The ACPI SPCR code has been used to enable console output for ARM64 and
X86. The same code can be reused for RISC-V. Furthermore, SPCR table is
mandated for headless system as outlined in the RISC-V BRS
Specification, chapter 6.
Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com>
Reviewed-by: Sunil V L <sunilvl@ventanamicro.com>
Link: https://lore.kernel.org/r/20240502073751.102093-2-jeeheng.sia@starfivetech.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Since commit 7caa976546 ("ftrace: riscv: move from REGS to ARGS"),
kprobe on ftrace is not supported by riscv, because riscv's support for
FTRACE_WITH_REGS has been replaced with support for FTRACE_WITH_ARGS, and
KPROBES_ON_FTRACE will be supplanted by FPROBES. So remove the deprecated
kprobe on ftrace support, which is misunderstood.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240613111347.1745379-1-ruanjinjie@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Charlie Jenkins <charlie@rivosinc.com> says:
All extensions, both standard and vendor, live in one struct
"riscv_isa_ext". There is currently one vendor extension, xandespmu, but
it is likely that more vendor extensions will be added to the kernel in
the future. As more vendor extensions (and standard extensions) are
added, riscv_isa_ext will become more bloated with a mix of vendor and
standard extensions.
This also allows each vendor to be conditionally enabled through
Kconfig.
* b4-shazam-merge:
riscv: cpufeature: Extract common elements from extension checking
riscv: Introduce vendor variants of extension helpers
riscv: Add vendor extensions to /proc/cpuinfo
riscv: Extend cpufeature.c to detect vendor extensions
Link: https://lore.kernel.org/r/20240719-support_vendor_extensions-v3-0-0af7587bbec0@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
All of the supported vendor extensions that have been listed in
riscv_isa_vendor_ext_list can be exported through /proc/cpuinfo.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Evan Green <evan@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240719-support_vendor_extensions-v3-2-0af7587bbec0@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Instead of grouping all vendor extensions into the same riscv_isa_ext
that standard instructions use, create a struct
"riscv_isa_vendor_ext_data_list" that allows each vendor to maintain
their vendor extensions independently of the standard extensions.
xandespmu is currently the only vendor extension so that is the only
extension that is affected by this change.
An additional benefit of this is that the extensions of each vendor can
be conditionally enabled. A config RISCV_ISA_VENDOR_EXT_ANDES has been
added to allow for that.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Andy Chiu <andy.chiu@sifive.com>
Tested-by: Yu Chien Peter Lin <peterlin@andestech.com>
Reviewed-by: Yu Chien Peter Lin <peterlin@andestech.com>
Link: https://lore.kernel.org/r/20240719-support_vendor_extensions-v3-1-0af7587bbec0@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Currently the entries appear to be in a random order (although according
to Palmer he has tried to sort them by key value) which makes it harder
to find entries in a growing list, and more likely to have conflicts as
all patches are adding to the end of the list. Sort them alphabetically
instead.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Clément Léger <cleger@rivosinc.com>
Link: https://lore.kernel.org/r/20240717-dedicate-squeamish-7e4ab54df58f@spud
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Haibo Xu <haibo1.xu@intel.com> says:
This patch series enable RISC-V ACPI NUMA support which was based on
the recently approved ACPI ECR[1].
Patch 1/4 add RISC-V specific acpi_numa.c file to parse NUMA information
from SRAT and SLIT ACPI tables.
Patch 2/4 add the common SRAT RINTC affinity structure handler.
Patch 3/4 change the ACPI_NUMA to a hidden option since it would be selected
by default on all supported platform.
Patch 4/4 replace pr_info with pr_debug in arch_acpi_numa_init() to avoid
potential boot noise on ACPI platforms that are not NUMA.
Based-on: https://github.com/linux-riscv/linux-riscv/tree/for-next
[1] https://drive.google.com/file/d/1YTdDx2IPm5IeZjAW932EYU-tUtgS08tX/view?usp=sharing
Testing:
Since the ACPI AIA/PLIC support patch set is still under upstream review,
hence it is tested using the poll based HVC SBI console and RAM disk.
1) Build latest Qemu with the following patch backported
42bd4eeefd
2) Build latest EDK-II
https://github.com/tianocore/edk2/blob/master/OvmfPkg/RiscVVirt/README.md
3) Build Linux with the following configs enabled
CONFIG_RISCV_SBI_V01=y
CONFIG_SERIAL_EARLYCON_RISCV_SBI=y
CONFIG_NONPORTABLE=y
CONFIG_HVC_RISCV_SBI=y
CONFIG_NUMA=y
CONFIG_ACPI_NUMA=y
4) Build buildroot rootfs.cpio
5) Launch the Qemu machine
qemu-system-riscv64 -nographic \
-machine virt,pflash0=pflash0,pflash1=pflash1 -smp 4 -m 8G \
-blockdev node-name=pflash0,driver=file,read-only=on,filename=RISCV_VIRT_CODE.fd \
-blockdev node-name=pflash1,driver=file,filename=RISCV_VIRT_VARS.fd \
-object memory-backend-ram,size=4G,id=m0 \
-object memory-backend-ram,size=4G,id=m1 \
-numa node,memdev=m0,cpus=0-1,nodeid=0 \
-numa node,memdev=m1,cpus=2-3,nodeid=1 \
-numa dist,src=0,dst=1,val=30 \
-kernel linux/arch/riscv/boot/Image \
-initrd buildroot/output/images/rootfs.cpio \
-append "root=/dev/ram ro console=hvc0 earlycon=sbi"
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x80000000-0x17fffffff]
[ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x180000000-0x27fffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x17fe3bc40-0x17fe3cfff]
[ 0.000000] NUMA: NODE_DATA [mem 0x27fff4c40-0x27fff5fff]
...
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> HARTID 0x0 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> HARTID 0x1 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> HARTID 0x2 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> HARTID 0x3 -> Node 1
* b4-shazam-merge:
ACPI: NUMA: replace pr_info with pr_debug in arch_acpi_numa_init
ACPI: NUMA: change the ACPI_NUMA to a hidden option
ACPI: NUMA: Add handler for SRAT RINTC affinity structure
ACPI: RISCV: Add NUMA support based on SRAT and SLIT
Link: https://lore.kernel.org/r/cover.1718268003.git.haibo1.xu@intel.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Add acpi_numa.c file to enable parse NUMA information from
ACPI SRAT and SLIT tables. SRAT table provide CPUs(Hart) and
memory nodes to proximity domain mapping, while SLIT table
provide the distance metrics between proximity domains.
Signed-off-by: Haibo Xu <haibo1.xu@intel.com>
Reviewed-by: Sunil V L <sunilvl@ventanamicro.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Link: https://lore.kernel.org/r/65dbad1fda08a32922c44886e4581e49b4a2fecc.1718268003.git.haibo1.xu@intel.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
* Support for various new ISA extensions:
* The Zve32[xf] and Zve64[xfd] sub-extensios of the vector
extension.
* Zimop and Zcmop for may-be-operations.
* The Zca, Zcf, Zcd and Zcb sub-extensions of the C extension.
* Zawrs,
* riscv,cpu-intc is now dtschema.
* A handful of performance improvements and cleanups to text patching.
* Support for memory hot{,un}plug
* The highest user-allocatable virtual address is now visible in
hwprobe.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmabIGETHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYiQe8D/9QPCaOnoP5OCZbwjkRBwaVxyknNyD0
l+YNXk7Jk3B/oaOv3d7Bz+uWt1SG4j4jkfyuGJ81StZykp4/R7T823TZrPhog9VX
IJm580MtvE49I2i1qJ+ZQti9wpiM+80lFnyMCzY6S7rrM9m62tKgUpARZcWoA55P
iUo5bku99TYCcU2k1pnPrNSPQvVpECpv7tG0PwKpQd5DiYjbPp+aw5cQWN+izdOB
6raOZ0buzP7McszvO/gcJs+kuHwrp0JSRvNxc2pwYZ0lx00p3hSV8UdtIMlI9Qm/
z3gkQGHwc6UVMPHo1x0Gr5ShUTCI/iSwy4/7aY4NNXF6Sj99b8alt9GcbYqNAE7V
k7sibCR7dhL4ods/GFMmzR7cQYlwlwtO+/ILak7rXhNvA32Xy1WUABguhP9ElTmw
1ZS2hnRv6wc7MA2V7HBamf5mPXM6HQyC3oKy3njzDSJdiGIG7aa+TOfRAD+L/1Du
QjIrKp6XcPIsZNjh8H3nMDVJ0VvDNnS4d4LbfNQc23VPzf57kFUqbli1pS0hBjFT
ELEItH9dgSx+T5Qebdy/QMC3RG8Yc1IUdw6VQ7Jny/uCCEZNq+VZ+bXxspMmswCp
sUIyDplJTJfRt3G2OxK0b95x6oj8jbaJOQfv6PBF71dDBsChg8eXFVJ2NDrX4Bvr
h2MPK7vGBtFz8w==
=+ICi
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.11-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:
- Support for various new ISA extensions:
* The Zve32[xf] and Zve64[xfd] sub-extensios of the vector
extension
* Zimop and Zcmop for may-be-operations
* The Zca, Zcf, Zcd and Zcb sub-extensions of the C extension
* Zawrs
- riscv,cpu-intc is now dtschema
- A handful of performance improvements and cleanups to text patching
- Support for memory hot{,un}plug
- The highest user-allocatable virtual address is now visible in
hwprobe
* tag 'riscv-for-linus-6.11-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (58 commits)
riscv: lib: relax assembly constraints in hweight
riscv: set trap vector earlier
KVM: riscv: selftests: Add Zawrs extension to get-reg-list test
KVM: riscv: Support guest wrs.nto
riscv: hwprobe: export Zawrs ISA extension
riscv: Add Zawrs support for spinlocks
dt-bindings: riscv: Add Zawrs ISA extension description
riscv: Provide a definition for 'pause'
riscv: hwprobe: export highest virtual userspace address
riscv: Improve sbi_ecall() code generation by reordering arguments
riscv: Add tracepoints for SBI calls and returns
riscv: Optimize crc32 with Zbc extension
riscv: Enable DAX VMEMMAP optimization
riscv: mm: Add support for ZONE_DEVICE
virtio-mem: Enable virtio-mem for RISC-V
riscv: Enable memory hotplugging for RISC-V
riscv: mm: Take memory hotplug read-lock during kernel page table dump
riscv: mm: Add memory hotplugging support
riscv: mm: Add pfn_to_kaddr() implementation
riscv: mm: Refactor create_linear_mapping_range() for memory hot add
...
Most of this is part of my ongoing work to clean up the system call
tables. In this bit, all of the newer architectures are converted to
use the machine readable syscall.tbl format instead in place of complex
macros in include/uapi/asm-generic/unistd.h.
This follows an earlier series that fixed various API mismatches
and in turn is used as the base for planned simplifications.
The other two patches are dead code removal and a warning fix.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmaVB1cACgkQYKtH/8kJ
UicMqxAAnYKOxfjoMIhYYK6bl126wg/vIcDcjIR9cNWH21Nhn3qxn11ZXau3S7xv
3l/HreEhyEQr4gC2a70IlXyHUadYOlrk+83OURrunWk1oKPmZlMKcfPVbtp8GL7x
PUNXQfwM1XZLveKwufY24hoZdwKC+Y/5WLc1t0ReznJuAqgeO2rM9W5dnV5bAfCp
he3F5hFcr196Dz3/GJjJIWrY+cbwfmZWsNtj1vFTL5/r/LuCu8HTkqhsGj8tE5BJ
NGVEEXbp5eaVTCIGqJWhnuZcsnKN9kM51M7CtdwWf8OTckUVuJap5OsDVKQkWkGl
bLPbd2jhDltph0sah51hAIvv4WdkThW76u9FRW7KR3fo7ra67eF7l5j7wc1lE2JB
GwLJ1X56Bxe1GhvvNTlDmb7DrnlP/DMPuRv3Z6xyH6l8iZ2pMGlnAxuw6Bs1s6Y5
WSs36ZpnS0ctgjfx37ZITsZSvbKFPpQFJP4siwS8aRNv/NFALNNdFyOCY5lNzspZ
0dxwjn6/7UpHE4MKh6/hvCg2QwupXXBTRytibw+75/rOsR+EYlmtuONtyq2sLUHe
ktJ5pg+8XuZm27+wLffuluzmY7sv2F8OU4cTYeM60Ynmc6pRzwUY6/VhG52S1/mU
Ua4VgYIpzOtlLrYmz5QTWIZpdSFSVbIc/3pLriD6hn4Mvg+BwdA=
=XOhL
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"Most of this is part of my ongoing work to clean up the system call
tables. In this bit, all of the newer architectures are converted to
use the machine readable syscall.tbl format instead in place of
complex macros in include/uapi/asm-generic/unistd.h.
This follows an earlier series that fixed various API mismatches and
in turn is used as the base for planned simplifications.
The other two patches are dead code removal and a warning fix"
* tag 'asm-generic-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
vmlinux.lds.h: catch .bss..L* sections into BSS")
fixmap: Remove unused set_fixmap_offset_io()
riscv: convert to generic syscall table
openrisc: convert to generic syscall table
nios2: convert to generic syscall table
loongarch: convert to generic syscall table
hexagon: use new system call table
csky: convert to generic syscall table
arm64: rework compat syscall macros
arm64: generate 64-bit syscall.tbl
arm64: convert unistd_32.h to syscall.tbl format
arc: convert to generic syscall table
clone3: drop __ARCH_WANT_SYS_CLONE3 macro
kbuild: add syscall table generation to scripts/Makefile.asm-headers
kbuild: verify asm-generic header list
loongarch: avoid generating extra header files
um: don't generate asm/bpf_perf_event.h
csky: drop asm/gpio.h wrapper
syscalls: add generic scripts/syscall.tbl
The exception vector of the booting hart is not set before enabling
the mmu and then still points to the value of the previous firmware,
typically _start. That makes it hard to debug setup_vm() when bad
things happen. So fix that by setting the exception vector earlier.
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: yang.zhang <yang.zhang@hexintek.com>
Link: https://lore.kernel.org/r/20240508022445.6131-1-gaoshanliukou@163.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Andrew Jones <ajones@ventanamicro.com> says:
Zawrs provides two instructions (wrs.nto and wrs.sto), where both are
meant to allow the hart to enter a low-power state while waiting on a
store to a memory location. The instructions also both wait an
implementation-defined "short" duration (unless the implementation
terminates the stall for another reason). The difference is that while
wrs.sto will terminate when the duration elapses, wrs.nto, depending on
configuration, will either just keep waiting or an ILL exception will be
raised. Linux will use wrs.nto, so if platforms have an implementation
which falls in the "just keep waiting" category (which is not expected),
then it should _not_ advertise Zawrs in the hardware description.
Like wfi (and with the same {m,h}status bits to configure it), when
wrs.nto is configured to raise exceptions it's expected that the higher
privilege level will see the instruction was a wait instruction, do
something, and then resume execution following the instruction. For
example, KVM does configure exceptions for wfi (hstatus.VTW=1) and
therefore also for wrs.nto. KVM does this for wfi since it's better to
allow other tasks to be scheduled while a VCPU waits for an interrupt.
For waits such as those where wrs.nto/sto would be used, which are
typically locks, it is also a good idea for KVM to be involved, as it
can attempt to schedule the lock holding VCPU.
This series starts with Christoph's addition of the riscv
smp_cond_load_relaxed function which applies wrs.sto when available.
That patch has been reworked to use wrs.nto and to use the same approach
as Arm for the wait loop, since we can't have arbitrary C code between
the load-reserved and the wrs. Then, hwprobe support is added (since the
instructions are also usable from usermode), and finally KVM is
taught about wrs.nto, allowing guests to see and use the Zawrs
extension.
We still don't have test results from hardware, and it's not possible to
prove that using Zawrs is a win when testing on QEMU, not even when
oversubscribing VCPUs to guests. However, it is possible to use KVM
selftests to force a scenario where we can prove Zawrs does its job and
does it well. [4] is a test which does this and, on my machine, without
Zawrs it takes 16 seconds to complete and with Zawrs it takes 0.25
seconds.
This series is also available here [1]. In order to use QEMU for testing
a build with [2] is needed. In order to enable guests to use Zawrs with
KVM using kvmtool, the branch at [3] may be used.
[1] https://github.com/jones-drew/linux/commits/riscv/zawrs-v3/
[2] https://lore.kernel.org/all/20240312152901.512001-2-ajones@ventanamicro.com/
[3] https://github.com/jones-drew/kvmtool/commits/riscv/zawrs/
[4] cb2beccebc
Link: https://lore.kernel.org/r/20240426100820.14762-8-ajones@ventanamicro.com
* b4-shazam-merge:
KVM: riscv: selftests: Add Zawrs extension to get-reg-list test
KVM: riscv: Support guest wrs.nto
riscv: hwprobe: export Zawrs ISA extension
riscv: Add Zawrs support for spinlocks
dt-bindings: riscv: Add Zawrs ISA extension description
riscv: Provide a definition for 'pause'
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Export Zawrs ISA extension through hwprobe.
[Palmer: there's a gap in the numbers here as there will be a merge
conflict when this is picked up. To avoid confusion I just set the
hwprobe ID to match what it would be post-merge.]
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Clément Léger <cleger@rivosinc.com>
Link: https://lore.kernel.org/r/20240426100820.14762-12-ajones@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
RISC-V code uses the generic ticket lock implementation, which calls
the macros smp_cond_load_relaxed() and smp_cond_load_acquire().
Introduce a RISC-V specific implementation of smp_cond_load_relaxed()
which applies WRS.NTO of the Zawrs extension in order to reduce power
consumption while waiting and allows hypervisors to enable guests to
trap while waiting. smp_cond_load_acquire() doesn't need a RISC-V
specific implementation as the generic implementation is based on
smp_cond_load_relaxed() and smp_acquire__after_ctrl_dep() sufficiently
provides the acquire semantics.
This implementation is heavily based on Arm's approach which is the
approach Andrea Parri also suggested.
The Zawrs specification can be found here:
https://github.com/riscv/riscv-zawrs/blob/main/zawrs.adoc
Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
Co-developed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20240426100820.14762-11-ajones@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Some userspace applications (OpenJDK for instance) uses the free MSBs
in pointers to insert additional information for their own logic and
need to get this information from somewhere. Currently they rely on
parsing /proc/cpuinfo "mmu=svxx" string to obtain the current value of
virtual address usable bits [1]. Since this reflect the raw supported
MMU mode, it might differ from the logical one used internally which is
why arch_get_mmap_end() is used. Exporting the highest mmapable address
through hwprobe will allow a more stable interface to be used. For that
purpose, add a new hwprobe key named
RISCV_HWPROBE_KEY_HIGHEST_VIRT_ADDRESS which will export the highest
userspace virtual address.
Link: https://github.com/openjdk/jdk/blob/master/src/hotspot/os_cpu/linux_riscv/vm_version_linux_riscv.cpp#L171 [1]
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20240410144558.1104006-1-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The sbi_ecall() function arguments are not in the same order as the
ecall arguments, so we end up re-ordering the registers before the
ecall which is useless and costly.
So simply reorder the arguments in the same way as expected by ecall.
Instead of reordering directly the arguments of sbi_ecall(), use a proxy
macro since the current ordering is more natural.
Before:
Dump of assembler code for function sbi_ecall:
0xffffffff800085e0 <+0>: add sp,sp,-32
0xffffffff800085e2 <+2>: sd s0,24(sp)
0xffffffff800085e4 <+4>: mv t1,a0
0xffffffff800085e6 <+6>: add s0,sp,32
0xffffffff800085e8 <+8>: mv t3,a1
0xffffffff800085ea <+10>: mv a0,a2
0xffffffff800085ec <+12>: mv a1,a3
0xffffffff800085ee <+14>: mv a2,a4
0xffffffff800085f0 <+16>: mv a3,a5
0xffffffff800085f2 <+18>: mv a4,a6
0xffffffff800085f4 <+20>: mv a5,a7
0xffffffff800085f6 <+22>: mv a6,t3
0xffffffff800085f8 <+24>: mv a7,t1
0xffffffff800085fa <+26>: ecall
0xffffffff800085fe <+30>: ld s0,24(sp)
0xffffffff80008600 <+32>: add sp,sp,32
0xffffffff80008602 <+34>: ret
After:
Dump of assembler code for function __sbi_ecall:
0xffffffff8000b6b2 <+0>: add sp,sp,-32
0xffffffff8000b6b4 <+2>: sd s0,24(sp)
0xffffffff8000b6b6 <+4>: add s0,sp,32
0xffffffff8000b6b8 <+6>: ecall
0xffffffff8000b6bc <+10>: ld s0,24(sp)
0xffffffff8000b6be <+12>: add sp,sp,32
0xffffffff8000b6c0 <+14>: ret
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Yunhui Cui <cuiyunhui@bytedance.com>
Link: https://lore.kernel.org/r/20240322112629.68170-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
These are useful for measuring the latency of SBI calls. The SBI HSM
extension is excluded because those functions are called from contexts
such as cpuidle where instrumentation is not allowed.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20240321230131.1838105-1-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The uapi/asm/unistd_{32,64}.h and asm/syscall_table_{32,64}.h headers can
now be generated from scripts/syscall.tbl, which makes this consistent
with the other architectures that have their own syscall.tbl.
riscv has two extra system call that gets added to scripts/syscall.tbl.
The newstat and rlimit entries in the syscall_abis_64 line are for system
calls that were part of the generic ABI when riscv64 got added but are
no longer enabled by default for new architectures. Both riscv32 and
riscv64 also implement memfd_secret, which is optional for all
architectures.
Unlike all the other 32-bit architectures, the time32 and stat64
sets of syscalls are not enabled on riscv32.
Both the user visible side of asm/unistd.h and the internal syscall
table in the kernel should have the same effective contents after this.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
If the kexec crash code is called in the interrupt context, the
machine_kexec_mask_interrupts() function will trigger a deadlock while
trying to acquire the irqdesc spinlock and then deactivate irqchip in
irq_set_irqchip_state() function.
Unlike arm64, riscv only requires irq_eoi handler to complete EOI and
keeping irq_set_irqchip_state() will only leave this possible deadlock
without any use. So we simply remove it.
Link: https://lore.kernel.org/linux-riscv/20231208111015.173237-1-songshuaishuai@tinylab.org/
Fixes: b17d19a531 ("riscv: kexec: Fixup irq controller broken in kexec crash path")
Signed-off-by: Song Shuai <songshuaishuai@tinylab.org>
Reviewed-by: Ryo Takakura <takakura@valinux.co.jp>
Link: https://lore.kernel.org/r/20240626023316.539971-1-songshuaishuai@tinylab.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
ftrace_graph_ret_addr() takes an `idx` integer pointer that is used to
optimize the stack unwinding. Pass it a valid pointer to utilize the
optimizations that might be available in the future.
The commit is making riscv's usage of ftrace_graph_ret_addr() match
x86_64.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20240618145820.62112-1-puranjay@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
* A fix for vector load/store instruction decoding, which could result
in reserved vector element length encodings decoding as valid vector
instructions.
* Instruction patching now aggressively flushes the local instruction
cache, to avoid situations where patching functions on the flush path
results in torn instructions being fetched.
* A fix to prevent the stack walker from showing up as part of traces.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmZ+4zUTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYiTXxD/wPSWbHf24Mr4CrFYbKR7lHWjGku+jG
8LQa+B9uUgpA8XjNjeeECg7lsJq/1avbPrUlValRckUIMZPHSWK4U7aFkkPs1WFa
87D5pA4AVkt5U8v/3c5GQ8Tod0Afa7OyFggxdglC3XFvUa5TNdn3pdv0rdE4Mx5l
QRijFyLlhRv/D5We+exNAVJmkdHfSXQEEyEjXeb83VK+PsZzAXvHLj3omxCyQ1kH
Kt2RyN8QpkUisFNTpSvPHiuoPjUeJvRs0JIyrO1SwBGHyYs4kg6g0KBk4YjTTXG5
xbRVG2tMO9TS2jRRHJS7fdI+yF0X1z+t+WUG1F1WHvKxnqbWczJ0UczFgacaVSkw
BM/yO+VPg22f6bM4H85K5GBdhN8PplFDuHDdVQ8/LDGOrQKaByrorXWq3WrbwJcq
vVwbnBGW6v4we5COzyHvwnakzl4bEMHoUb2NVTzZM+fFleiEdx4Sg+F5Us8+UlZh
PztwjPao8spIm81l/wXStxYzschDyonCt74/odic2LDtFBirZWzDdUInXFVzUZs7
CUxF38XJ6SNQJBVVwQv6qisoWhy6Ca9SGKwY2GwW7Ustx3C4Eh0nrOVmI/DcHRgN
9rGm13Qfm8eUSznTM+buWTTluZvtmZupGpAP2GhvaUgTMDIfK/vttidW9Kf4KsP8
hn+jllIc1WgE6Q==
=0bpZ
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for vector load/store instruction decoding, which could result
in reserved vector element length encodings decoding as valid vector
instructions.
- Instruction patching now aggressively flushes the local instruction
cache, to avoid situations where patching functions on the flush path
results in torn instructions being fetched.
- A fix to prevent the stack walker from showing up as part of traces.
* tag 'riscv-for-linus-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: stacktrace: convert arch_stack_walk() to noinstr
riscv: patch: Flush the icache right after patching to avoid illegal insns
RISC-V: fix vector insn load/store width mask
Clément Léger <cleger@rivosinc.com> says:
Add support for (yet again) more RVA23U64 missing extensions. Add
support for Zimop, Zcmop, Zca, Zcf, Zcd and Zcb extensions ISA string
parsing, hwprobe and kvm support. Zce, Zcmt and Zcmp extensions have
been left out since they target microcontrollers/embedded CPUs and are
not needed by RVA23U64.
Since Zc* extensions states that C implies Zca, Zcf (if F and RV32), Zcd
(if D), this series modifies the way ISA string is parsed and now does
it in two phases. First one parses the string and the second one
validates it for the final ISA description.
* b4-shazam-merge:
KVM: riscv: selftests: Add Zcmop extension to get-reg-list test
RISC-V: KVM: Allow Zcmop extension for Guest/VM
riscv: hwprobe: export Zcmop ISA extension
riscv: add ISA extension parsing for Zcmop
dt-bindings: riscv: add Zcmop ISA extension description
KVM: riscv: selftests: Add some Zc* extensions to get-reg-list test
RISC-V: KVM: Allow Zca, Zcf, Zcd and Zcb extensions for Guest/VM
riscv: hwprobe: export Zca, Zcf, Zcd and Zcb ISA extensions
riscv: add ISA parsing for Zca, Zcf, Zcd and Zcb
riscv: add ISA extensions validation callback
dt-bindings: riscv: add Zca, Zcf, Zcd and Zcb ISA extension description
KVM: riscv: selftests: Add Zimop extension to get-reg-list test
RISC-V: KVM: Allow Zimop extension for Guest/VM
riscv: hwprobe: export Zimop ISA extension
riscv: add ISA extension parsing for Zimop
dt-bindings: riscv: add Zimop ISA extension description
Link: https://lore.kernel.org/r/20240619113529.676940-1-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Add parsing for Zcmop ISA extension which was ratified in commit
c732a4f39a4c ("Zcmop is ratified/1.0") of the riscv-isa-manual.
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20240619113529.676940-14-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The Zc* standard extension for code reduction introduces new extensions.
This patch adds support for Zca, Zcf, Zcd and Zcb. Zce, Zcmt and Zcmp
are left out of this patch since they are targeting microcontrollers/
embedded CPUs instead of application processors.
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240619113529.676940-9-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Since a few extensions (Zicbom/Zicboz) already needs validation and
future ones will need it as well (Zc*) add a validate() callback to
struct riscv_isa_ext_data. This require to rework the way extensions are
parsed and split it in two phases. First phase is isa string or isa
extension list parsing and consists in enabling all the extensions in a
temporary bitmask (source isa) without any validation. The second step
"resolves" the final isa bitmap, handling potential missing dependencies.
The mechanism is quite simple and simply validate each extension
described in the source bitmap before enabling it in the resolved isa
bitmap. validate() callbacks can return either 0 for success,
-EPROBEDEFER if extension needs to be validated again at next loop. A
previous ISA bitmap is kept to avoid looping multiple times if an
extension dependencies are never satisfied until we reach a stable
state. In order to avoid any potential infinite looping, allow looping
a maximum of the number of extension we handle. Zicboz and Zicbom
extensions are modified to use this validation mechanism.
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240619113529.676940-8-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Add parsing for Zimop ISA extension which was ratified in commit
58220614a5f of the riscv-isa-manual.
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20240619113529.676940-3-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This first patch in the larger series is a fix, so I'm merging it into
fixes while the rest of the patch set is still under development.
* b4-shazam-merge:
riscv: stacktrace: convert arch_stack_walk() to noinstr
Link: https://lore.kernel.org/r/20240613-dev-andyc-dyn-ftrace-v4-v1-0-1a538e12c01e@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
arch_stack_walk() is called intensively in function_graph when the
kernel is compiled with CONFIG_TRACE_IRQFLAGS. As a result, the kernel
logs a lot of arch_stack_walk and its sub-functions into the ftrace
buffer. However, these functions should not appear on the trace log
because they are part of the ftrace itself. This patch references what
arm64 does for the smae function. So it further prevent the re-enter
kprobe issue, which is also possible on riscv.
Related-to: commit 0fbcd8abf3 ("arm64: Prohibit instrumentation on arch_stack_walk()")
Fixes: 680341382d ("riscv: add CALLER_ADDRx support")
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240613-dev-andyc-dyn-ftrace-v4-v1-1-1a538e12c01e@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
We cannot delay the icache flush after patching some functions as we may
have patched a function that will get called before the icache flush.
The only way to completely avoid such scenario is by flushing the icache
as soon as we patch a function. This will probably be costly as we don't
batch the icache maintenance anymore.
Fixes: 6ca445d8af ("riscv: Fix early ftrace nop patching")
Reported-by: Conor Dooley <conor.dooley@microchip.com>
Closes: https://lore.kernel.org/linux-riscv/20240613-lubricant-breath-061192a9489a@wendy/
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Andy Chiu <andy.chiu@sifive.com>
Link: https://lore.kernel.org/r/20240624082141.153871-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Samuel Holland <samuel.holland@sifive.com> says:
Here are a few changes to minimize calls to stop_machine() and
flush_icache_*() in the various text patching functions, as well as
to simplify the code.
* b4-shazam-merge:
riscv: Remove extra variable in patch_text_nosync()
riscv: Use offset_in_page() in text patching functions
riscv: Pass patch_text() the length in bytes
riscv: Simplify text patching loops
riscv: kprobes: Use patch_text_nosync() for insn slots
riscv: jump_label: Simplify assembly syntax
riscv: jump_label: Batch icache maintenance
Link: https://lore.kernel.org/r/20240327160520.791322-1-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This cast is superfluous, and is incorrect anyway if compressed
instructions may be present.
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240327160520.791322-8-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This is a bit easier to parse than the equivalent bit manipulation.
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240327160520.791322-7-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
patch_text_nosync() already handles an arbitrary length of code, so this
removes a superfluous loop and reduces the number of icache flushes.
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240327160520.791322-6-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This reduces the number of variables and makes the code easier to parse.
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240327160520.791322-5-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
These instructions are not yet visible to the rest of the system,
so there is no need to do the whole stop_machine() dance.
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20240327160520.791322-4-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Switch to the batched version of the jump label update functions so
instruction cache maintenance is deferred until the end of the update.
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20240327160520.791322-2-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Most architectures that implement the old-style mmap() with byte offset
use 'unsigned long' as the type for that offset, but microblaze and
riscv have the off_t type that is shared with userspace, matching the
prototype in include/asm-generic/syscalls.h.
Make this consistent by using an unsigned argument everywhere. This
changes the behavior slightly, as the argument is shifted to a page
number, and an user input with the top bit set would result in a
negative page offset rather than a large one as we use elsewhere.
For riscv, the 32-bit sys_mmap2() definition actually used a custom
type that is different from the global declaration, but this was
missed due to an incorrect type check.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Make has_vector() to check for ZVE32X. Every in-kernel usage of V that
requires a more complicate version of V must then call out explicitly.
Also, change riscv_v_first_use_handler(), and boot code that calls
riscv_v_setup_vsize() to accept ZVE32X.
Most kernel/user interfaces requires minimum of ZVE32X. Thus, programs
compiled and run with ZVE32X should be supported by the kernel on most
aspects. This includes context-switch, signal, ptrace, prctl, and
hwprobe.
One exception is that ELF_HWCAP returns 'V' only if full V is supported
on the platform. This means that the system without a full V must not
rely on ELF_HWCAP to tell whether it is allowable to execute Vector
without first invoking a prctl() check.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Acked-by: Joel Granados <j.granados@samsung.com>
Link: https://lore.kernel.org/r/20240510-zve-detection-v5-7-0711bdd26c12@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The following Vector subextensions for "embedded" platforms are added
into RISCV_HWPROBE_KEY_IMA_EXT_0:
- ZVE32X
- ZVE32F
- ZVE64X
- ZVE64F
- ZVE64D
Extensions ending with an X indicates that the platform doesn't have a
vector FPU.
Extensions ending with F/D mean that whether single (F) or double (D)
precision vector operation is supported.
The number 32 or 64 follows from ZVE tells the maximum element length.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Clément Léger <cleger@rivosinc.com>
Link: https://lore.kernel.org/r/20240510-zve-detection-v5-6-0711bdd26c12@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Multiple Vector subextensions are added. Also, the patch takes care of
the dependencies of Vector subextensions by macro expansions. So, if
some "embedded" platform only reports "zve64f" on the ISA string, the
parser is able to expand it to zve32x zve32f zve64x and zve64f.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240510-zve-detection-v5-5-0711bdd26c12@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Single-letter extensions may also imply multiple subextensions. For
example, Vector extension implies zve64d, and zve64d implies zve64f.
Extension parsing for "riscv,isa-extensions" has the ability to resolve
the dependency by calling match_isa_ext(). This patch makes deprecated
parser call the same function for single letter extensions.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240510-zve-detection-v5-3-0711bdd26c12@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Currently we only support Vector for SMP platforms, that is, all SMP
cores have the same vlenb. If we happen to detect a mismatching vlen, it
is better to just fail bootting it up to prevent further race/scheduling
issues.
Also, move .Lsecondary_park forward and chage `tail smp_callin` into a
regular call in the early assembly. So a core would be parked right
after a return from smp_callin. Note that a successful smp_callin
does not return.
Fixes: 7017858eb2 ("riscv: Introduce riscv_v_vsize to record size of Vector context")
Reported-by: Conor Dooley <conor.dooley@microchip.com>
Closes: https://lore.kernel.org/linux-riscv/20240228-vicinity-cornstalk-4b8eb5fe5730@spud/
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Yunhui Cui <cuiyunhui@bytedance.com>
Link: https://lore.kernel.org/r/20240510-zve-detection-v5-2-0711bdd26c12@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The function would fail when it detects the calling hart's vlen doesn't
match the first one's. The boot hart is the first hart calling this
function during riscv_fill_hwcap, so it is impossible to fail here. Add
a comment about this behavior.
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20240510-zve-detection-v5-1-0711bdd26c12@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Top of the kernel thread stack should be reserved for pt_regs. However
this is not the case for the idle threads of the secondary boot harts.
Their stacks overlap with their pt_regs, so both may get corrupted.
Similar issue has been fixed for the primary hart, see c7cdd96eca
("riscv: prevent stack corruption by reserving task_pt_regs(p) early").
However that fix was not propagated to the secondary harts. The problem
has been noticed in some CPU hotplug tests with V enabled. The function
smp_callin stored several registers on stack, corrupting top of pt_regs
structure including status field. As a result, kernel attempted to save
or restore inexistent V context.
Fixes: 9a2451f186 ("RISC-V: Avoid using per cpu array for ordered booting")
Fixes: 2875fe0561 ("RISC-V: Add cpu_ops and modify default booting method")
Signed-off-by: Sergey Matyukevich <sergey.matyukevich@syntacore.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240523084327.2013211-1-geomatsi@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
* The compression format used for boot images is now configurable at
build time, and these formats are shown in `make help`.
* access_ok() has been optimized.
* A pair of performance bugs have been fixed in the uaccess handlers.
* Various fixes and cleanups, including one for the IMSIC build failure
and one for the early-boot ftrace illegal NOPs bug.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmZQtRwTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYiWPBD/0UitwMg88m6urvMd0Pfvwwbu/OnGqW
TZT8C55iJi/e5f9K4mBrSyjATI8z/MblD+Zz0adX8ygavS4JuQ7DoWwb1yTT3pww
+z74FkWeJuiar+HfbhQ602CfMrnzvWjnyJ3URemqy5pIBKyvD9gGkDJDZwf8hJTk
Vqh5qVtnBqFBO9kWpIx+/pLCfpyHVNkhWr1AzKfoqQ1WPIpZ/o0IGdvS88rL+EBR
QOXiwVhEsRfC+LT6Jhn8l2bGp7PaSRVOid19OxNsJKpAhpL6AOscaafclVrLBuTd
gkys0rT2dHdoWTAkPHQpvlOI6OmGTgopxo5pUKJHS8J9VRoBun25zC1FGBF8uyVd
05CabWPnh7olNsRge9XiNj3x8PXjGVi7X7wUbRgOBG5aDc6TbKdxu37J0tXe0M7a
Q74ctQvk8Nk6bQWirgTNlfJJHzL5pJbKc9VwY5uGX4qTmH+yEvCIt45ZXgXOuS/F
eqijStkkdXUDnkMdcpaZJvXP80rHcgfP8bqevvPymRli8ER9zj9aXJQ3rmCUcPz+
EtbyS+vOEN31wNTA1EQlfIRxfvr22x7r70DDdRwmhuD1W1tgfblm+R0Cq76I5rnJ
VSgXKq1b4mY0eautqXEnPGyqb7H8iJIq7AoyfbzzWN+4u6yVEUvpDKueeksy+fFt
sGNtjWqGhWyKXg==
=/Qtt
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.10-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull more RISC-V updates from Palmer Dabbelt:
- The compression format used for boot images is now configurable at
build time, and these formats are shown in `make help`
- access_ok() has been optimized
- A pair of performance bugs have been fixed in the uaccess handlers
- Various fixes and cleanups, including one for the IMSIC build failure
and one for the early-boot ftrace illegal NOPs bug
* tag 'riscv-for-linus-6.10-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fix early ftrace nop patching
irqchip: riscv-imsic: Fixup riscv_ipi_set_virq_range() conflict
riscv: selftests: Add signal handling vector tests
riscv: mm: accelerate pagefault when badaccess
riscv: uaccess: Relax the threshold for fast path
riscv: uaccess: Allow the last potential unrolled copy
riscv: typo in comment for get_f64_reg
Use bool value in set_cpu_online()
riscv: selftests: Add hwprobe binaries to .gitignore
riscv: stacktrace: fixed walk_stackframe()
ftrace: riscv: move from REGS to ARGS
riscv: do not select MODULE_SECTIONS by default
riscv: show help string for riscv-specific targets
riscv: make image compression configurable
riscv: cpufeature: Fix extension subset checking
riscv: cpufeature: Fix thead vector hwcap removal
riscv: rewrite __kernel_map_pages() to fix sleeping in invalid context
riscv: force PAGE_SIZE linear mapping if debug_pagealloc is enabled
riscv: Define TASK_SIZE_MAX for __access_ok()
riscv: Remove PGDIR_SIZE_L3 and TASK_SIZE_MIN
Commit c97bf62996 ("riscv: Fix text patching when IPI are used")
converted ftrace_make_nop() to use patch_insn_write() which does not
emit any icache flush relying entirely on __ftrace_modify_code() to do
that.
But we missed that ftrace_make_nop() was called very early directly when
converting mcount calls into nops (actually on riscv it converts 2B nops
emitted by the compiler into 4B nops).
This caused crashes on multiple HW as reported by Conor and Björn since
the booting core could have half-patched instructions in its icache
which would trigger an illegal instruction trap: fix this by emitting a
local flush icache when early patching nops.
Fixes: c97bf62996 ("riscv: Fix text patching when IPI are used")
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reported-by: Conor Dooley <conor.dooley@microchip.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20240523115134.70380-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Bergmann which enables a number of additional build-time warnings. We
fixed all the fallout which we could find, there may still be a few
stragglers.
- Samuel Holland has developed the series "Unified cross-architecture
kernel-mode FPU API". This does a lot of consolidation of
per-architecture kernel-mode FPU usage and enables the use of newer AMD
GPUs on RISC-V.
- Tao Su has fixed some selftests build warnings in the series
"Selftests: Fix compilation warnings due to missing _GNU_SOURCE
definition".
- This pull also includes a nilfs2 fixup from Ryusuke Konishi.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZk6OSAAKCRDdBJ7gKXxA
jpTGAP9hQaZ+g7CO38hKQAtEI8rwcZJtvUAP84pZEGMjYMGLxQD/S8z1o7UHx61j
DUbnunbOkU/UcPx3Fs/gp4KcJARMEgs=
=EPi9
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-05-22-17-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more non-mm updates from Andrew Morton:
- A series ("kbuild: enable more warnings by default") from Arnd
Bergmann which enables a number of additional build-time warnings. We
fixed all the fallout which we could find, there may still be a few
stragglers.
- Samuel Holland has developed the series "Unified cross-architecture
kernel-mode FPU API". This does a lot of consolidation of
per-architecture kernel-mode FPU usage and enables the use of newer
AMD GPUs on RISC-V.
- Tao Su has fixed some selftests build warnings in the series
"Selftests: Fix compilation warnings due to missing _GNU_SOURCE
definition".
- This pull also includes a nilfs2 fixup from Ryusuke Konishi.
* tag 'mm-nonmm-stable-2024-05-22-17-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
nilfs2: make block erasure safe in nilfs_finish_roll_forward()
selftests/harness: use 1024 in place of LINE_MAX
Revert "selftests/harness: remove use of LINE_MAX"
selftests/fpu: allow building on other architectures
selftests/fpu: move FP code to a separate translation unit
drm/amd/display: use ARCH_HAS_KERNEL_FPU_SUPPORT
drm/amd/display: only use hard-float, not altivec on powerpc
riscv: add support for kernel-mode FPU
x86: implement ARCH_HAS_KERNEL_FPU_SUPPORT
powerpc: implement ARCH_HAS_KERNEL_FPU_SUPPORT
LoongArch: implement ARCH_HAS_KERNEL_FPU_SUPPORT
lib/raid6: use CC_FLAGS_FPU for NEON CFLAGS
arm64: crypto: use CC_FLAGS_FPU for NEON CFLAGS
arm64: implement ARCH_HAS_KERNEL_FPU_SUPPORT
ARM: crypto: use CC_FLAGS_FPU for NEON CFLAGS
ARM: implement ARCH_HAS_KERNEL_FPU_SUPPORT
arch: add ARCH_HAS_KERNEL_FPU_SUPPORT
x86/fpu: fix asm/fpu/types.h include guard
kbuild: enable -Wcast-function-type-strict unconditionally
kbuild: enable -Wformat-truncation on clang
...
Charlie Jenkins <charlie@rivosinc.com> says:
This series contains two minor fixes for the extension parsing in
cpufeature.c.
Some T-Head boards without vector 1.0 support report "v" in the isa
string in their DT which will cause the kernel to run vector code. The
code to blacklist "v" from these boards was doing so by using
riscv_cached_mvendorid() which has not been populated at the time of
extension parsing. This fix instead greedily reads the mvendorid CSR of
the boot hart to determine if the cpu is from T-Head.
The other fix is for an incorrect indexing bug. riscv extensions
sometimes imply other extensions. When adding these "subset" extensions
to the hardware capabilities array, they need to be checked if they are
valid. The current code only checks if the extension that is including
other extensions is valid and not the subset extensions.
These patches were previously included in:
https://lore.kernel.org/lkml/20240420-dev-charlie-support_thead_vector_6_9-v3-0-67cff4271d1d@rivosinc.com/
* b4-shazam-merge:
riscv: cpufeature: Fix extension subset checking
riscv: cpufeature: Fix thead vector hwcap removal
Link: https://lore.kernel.org/r/20240502-cpufeature_fixes-v4-0-b3d1a088722d@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The declaration of set_cpu_online() takes a bool value. So replace
int here to make it consistent with the declaration.
Signed-off-by: Zhao Ke <ke.zhao@shingroup.cn>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20240318065404.123668-1-ke.zhao@shingroup.cn
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
If the load access fault occures in a leaf function (with
CONFIG_FRAME_POINTER=y), when wrong stack trace will be displayed:
[<ffffffff804853c2>] regmap_mmio_read32le+0xe/0x1c
---[ end trace 0000000000000000 ]---
Registers dump:
ra 0xffffffff80485758 <regmap_mmio_read+36>
sp 0xffffffc80200b9a0
fp 0xffffffc80200b9b0
pc 0xffffffff804853ba <regmap_mmio_read32le+6>
Stack dump:
0xffffffc80200b9a0: 0xffffffc80200b9e0 0xffffffc80200b9e0
0xffffffc80200b9b0: 0xffffffff8116d7e8 0x0000000000000100
0xffffffc80200b9c0: 0xffffffd8055b9400 0xffffffd8055b9400
0xffffffc80200b9d0: 0xffffffc80200b9f0 0xffffffff8047c526
0xffffffc80200b9e0: 0xffffffc80200ba30 0xffffffff8047fe9a
The assembler dump of the function preambula:
add sp,sp,-16
sd s0,8(sp)
add s0,sp,16
In the fist stack frame, where ra is not stored on the stack we can
observe:
0(sp) 8(sp)
.---------------------------------------------.
sp->| frame->fp | frame->ra (saved fp) |
|---------------------------------------------|
fp->| .... | .... |
|---------------------------------------------|
| | |
and in the code check is performed:
if (regs && (regs->epc == pc) && (frame->fp & 0x7))
I see no reason to check frame->fp value at all, because it is can be
uninitialized value on the stack. A better way is to check frame->ra to
be an address on the stack. After the stacktrace shows as expect:
[<ffffffff804853c2>] regmap_mmio_read32le+0xe/0x1c
[<ffffffff80485758>] regmap_mmio_read+0x24/0x52
[<ffffffff8047c526>] _regmap_bus_reg_read+0x1a/0x22
[<ffffffff8047fe9a>] _regmap_read+0x5c/0xea
[<ffffffff80480376>] _regmap_update_bits+0x76/0xc0
...
---[ end trace 0000000000000000 ]---
As pointed by Samuel Holland it is incorrect to remove check of the stackframe
entirely.
Changes since v2 [2]:
- Add accidentally forgotten curly brace
Changes since v1 [1]:
- Instead of just dropping frame->fp check, replace it with validation of
frame->ra, which should be a stack address.
- Move frame pointer validation into the separate function.
[1] https://lore.kernel.org/linux-riscv/20240426072701.6463-1-dev.mbstr@gmail.com/
[2] https://lore.kernel.org/linux-riscv/20240521131314.48895-1-dev.mbstr@gmail.com/
Fixes: f766f77a74 ("riscv/stacktrace: Fix stack output without ra on the stack top")
Signed-off-by: Matthew Bystrin <dev.mbstr@gmail.com>
Reviewed-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20240521191727.62012-1-dev.mbstr@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This commit replaces riscv's support for FTRACE_WITH_REGS with support
for FTRACE_WITH_ARGS. This is required for the ongoing effort to stop
relying on stop_machine() for RISCV's implementation of ftrace.
The main relevant benefit that this change will bring for the above
use-case is that now we don't have separate ftrace_caller and
ftrace_regs_caller trampolines. This will allow the callsite to call
ftrace_caller by modifying a single instruction. Now the callsite can
do something similar to:
When not tracing: | When tracing:
func: func:
auipc t0, ftrace_caller_top auipc t0, ftrace_caller_top
nop <=========<Enable/Disable>=========> jalr t0, ftrace_caller_bottom
[...] [...]
The above assumes that we are dropping the support of calling a direct
trampoline from the callsite. We need to drop this as the callsite can't
change the target address to call, it can only enable/disable a call to
a preset target (ftrace_caller in the above diagram). We can later optimize
this by calling an intermediate dispatcher trampoline before ftrace_caller.
Currently, ftrace_regs_caller saves all CPU registers in the format of
struct pt_regs and allows the tracer to modify them. We don't need to
save all of the CPU registers because at function entry only a subset of
pt_regs is live:
|----------+----------+---------------------------------------------|
| Register | ABI Name | Description |
|----------+----------+---------------------------------------------|
| x1 | ra | Return address for traced function |
| x2 | sp | Stack pointer |
| x5 | t0 | Return address for ftrace_caller trampoline |
| x8 | s0/fp | Frame pointer |
| x10-11 | a0-1 | Function arguments/return values |
| x12-17 | a2-7 | Function arguments |
|----------+----------+---------------------------------------------|
See RISCV calling convention[1] for the above table.
Saving just the live registers decreases the amount of stack space
required from 288 Bytes to 112 Bytes.
Basic testing was done with this on the VisionFive 2 development board.
Note:
- Moving from REGS to ARGS will mean that RISCV will stop supporting
KPROBES_ON_FTRACE as it requires full pt_regs to be saved.
- KPROBES_ON_FTRACE will be supplanted by FPROBES see [2].
[1] https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf
[2] https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20240405142453.4187-1-puranjay@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
* Support for byte/half-word compare-and-exchange, emulated via LR/SC
loops.
* Support for Rust.
* Support for Zihintpause in hwprobe.
* Support for the PR_RISCV_SET_ICACHE_FLUSH_CTX prctl().
* Support for lockless lockrefs.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmZN/hcTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYiVrGEACUT3gsbTx1q7fa11iQNxOjVkpl66Qn
7+kI+V9xt5+GuH2EjJk6AsSNHPKeQ8totbSTA8AZjINFvgVjXslN+DPpcjCFKvnh
NN5/Lyd64X0PZMsxGWlN9SHTFWf2b7lalCnY51BlX/IpBbHWc/no9XUsPSVixx6u
9q+JoS3D1DDV92nGcA/UK9ICCsDcf4omWgZW7KbjnVWnuY9jt4ctTy11jtF2RM9R
Z9KAWh0RqPzjz0vNbBBf9Iw7E4jt/Px6HDYPfZAiE2dVsCTHjdsC7TcGRYXzKt6F
4q9zg8kzwvUG5GaBl7/XprXO1vaeOUmPcTVoE7qlRkSdkknRH/iBz1P4hk+r0fze
f+h5ZUV/oJP7vDb+vHm/BExtGufgLuJ2oMA2Bp9qI17EMcMsGiRMt7DsBMEafWDk
bNrFcJdqqYBz6HxfTwzNH5ErxfS/59PuwYl913BTSOH//raCZCFXOfyrSICH7qXd
UFOLLmBpMuApLa8ayFeI9Mp3flWfbdQHR52zLRLiUvlpWNEDKrNQN417juVwTXF0
DYkjJDhFPLfFOr/sJBboftOMOUdA9c/CJepY9o4kPvBXUvPtRHN1jdXDNSCVDZRb
nErnsJ9rv0PzfxQU7Xjhd2QmCMeMlbCQDpXAKKETyyimpTbgF33rovN0i5ixX3m4
KG6RvKDubOzZdA==
=YLoD
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.10-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:
- Add byte/half-word compare-and-exchange, emulated via LR/SC loops
- Support for Rust
- Support for Zihintpause in hwprobe
- Add PR_RISCV_SET_ICACHE_FLUSH_CTX prctl()
- Support lockless lockrefs
* tag 'riscv-for-linus-6.10-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (42 commits)
riscv: defconfig: Enable CONFIG_CLK_SOPHGO_CV1800
riscv: select ARCH_HAS_FAST_MULTIPLIER
riscv: mm: still create swiotlb buffer for kmalloc() bouncing if required
riscv: Annotate pgtable_l{4,5}_enabled with __ro_after_init
riscv: Remove redundant CONFIG_64BIT from pgtable_l{4,5}_enabled
riscv: mm: Always use an ASID to flush mm contexts
riscv: mm: Preserve global TLB entries when switching contexts
riscv: mm: Make asid_bits a local variable
riscv: mm: Use a fixed layout for the MM context ID
riscv: mm: Introduce cntx2asid/cntx2version helper macros
riscv: Avoid TLB flush loops when affected by SiFive CIP-1200
riscv: Apply SiFive CIP-1200 workaround to single-ASID sfence.vma
riscv: mm: Combine the SMP and UP TLB flush code
riscv: Only send remote fences when some other CPU is online
riscv: mm: Broadcast kernel TLB flushes only when needed
riscv: Use IPIs for remote cache/TLB flushes by default
riscv: Factor out page table TLB synchronization
riscv: Flush the instruction cache during SMP bringup
riscv: hwprobe: export Zihintpause ISA extension
riscv: misaligned: remove CONFIG_RISCV_M_MODE specific code
...
This loop is supposed to check if ext->subset_ext_ids[j] is valid, rather
than if ext->subset_ext_ids[i] is valid, before setting the extension
id ext->subset_ext_ids[j] in isainfo->isa.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Fixes: 0d8295ed97 ("riscv: add ISA extension parsing for scalar crypto")
Link: https://lore.kernel.org/r/20240502-cpufeature_fixes-v4-2-b3d1a088722d@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The riscv_cpuinfo struct that contains mvendorid and marchid is not
populated until all harts are booted which happens after the DT parsing.
Use the mvendorid/marchid from the boot hart to determine if the DT
contains an invalid V.
Fixes: d82f32202e ("RISC-V: Ignore V from the riscv,isa DT property on older T-Head CPUs")
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Guo Ren <guoren@kernel.org>
Link: https://lore.kernel.org/r/20240502-cpufeature_fixes-v4-1-b3d1a088722d@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This is motivated by the amdgpu DRM driver, which needs floating-point
code to support recent hardware. That code is not performance-critical,
so only provide a minimal non-preemptible implementation for now.
Support is limited to riscv64 because riscv32 requires runtime (libgcc)
assistance to convert between doubles and 64-bit integers.
Link: https://lkml.kernel.org/r/20240329072441.591471-12-samuel.holland@sifive.com
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Xuerui <git@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
documented (hopefully adequately) in the respective changelogs. Notable
series include:
- Lucas Stach has provided some page-mapping
cleanup/consolidation/maintainability work in the series "mm/treewide:
Remove pXd_huge() API".
- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one
test.
- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being allocated:
number of calls and amount of memory.
- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in largely
similar code sites.
- In his series "mm: page_alloc: freelist migratetype hygiene" Johannes
Weiner has fixed the page allocator's handling of migratetype requests,
with resulting improvements in compaction efficiency.
- In the series "make the hugetlb migration strategy consistent" Baolin
Wang has fixed a hugetlb migration issue, which should improve hugetlb
allocation reliability.
- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when memory
almost met memcg limit".
- In the series "mm/filemap: optimize folio adding and splitting" Kairui
Song has optimized pagecache insertion, yielding ~10% performance
improvement in one test.
- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".
- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".
- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".
- More cleanups from Matthew Wilcox in the series "Various page->flags
cleanups".
- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".
- More folio conversions and cleanups in Matthew Wilcox's series
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"
- David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb
functions to work on folis".
- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".
- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series
"mm/ksm: fix ksm exec support for prctl".
- Baolin Wang has implemented NUMA balancing for multi-size THPs. This
is a simple first-cut implementation for now. The series is "support
multi-size THP numa balancing".
- Cleanups to vma handling helper functions from Matthew Wilcox in the
series "Unify vma_address and vma_pgoff_address".
- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".
- Improvements to the swapping of multi-size THPs from Ryan Roberts in
the series "Swap-out mTHP without splitting".
- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
- GUP cleanups from David Hildenbrand in "mm/gup: consistently call it
GUP-fast".
- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to
use struct vm_fault".
- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".
- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes
the initialization code so that migration between different memory types
works as intended.
- David Hildenbrand has improved follow_pte() and fixed an errant driver
in the series "mm: follow_pte() improvements and acrn follow_pte()
fixes".
- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".
- Folio conversions in KSM in Alex Shi's series "transfer page to folio
in KSM".
- Barry Song has added some sysfs stats for monitoring multi-size THP's
in the series "mm: add per-order mTHP alloc and swpout counters".
- Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled
and limit checking cleanups".
- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".
- Multi-size THPs get more work, this time from Lance Yang. His series
"mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes
the freeing of these things.
- Kemeng Shi has added more userspace-visible writeback instrumentation
in the series "Improve visibility of writeback".
- Kemeng Shi then sent some maintenance work on top in the series "Fix
and cleanups to page-writeback".
- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the
series "Improve anon_vma scalability for anon VMAs". Intel's test bot
reported an improbable 3x improvement in one test.
- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"
- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"
- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL".
- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".
- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA
jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB
nvA4E0DcPrUAFy144FNM0NTCb7u9vAw=
=V3R/
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
"The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.
Notable series include:
- Lucas Stach has provided some page-mapping cleanup/consolidation/
maintainability work in the series "mm/treewide: Remove pXd_huge()
API".
- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
one test.
- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being
allocated: number of calls and amount of memory.
- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in
largely similar code sites.
- In his series "mm: page_alloc: freelist migratetype hygiene"
Johannes Weiner has fixed the page allocator's handling of
migratetype requests, with resulting improvements in compaction
efficiency.
- In the series "make the hugetlb migration strategy consistent"
Baolin Wang has fixed a hugetlb migration issue, which should
improve hugetlb allocation reliability.
- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when
memory almost met memcg limit".
- In the series "mm/filemap: optimize folio adding and splitting"
Kairui Song has optimized pagecache insertion, yielding ~10%
performance improvement in one test.
- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".
- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".
- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".
- More cleanups from Matthew Wilcox in the series "Various
page->flags cleanups".
- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".
- More folio conversions and cleanups in Matthew Wilcox's series:
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"
- David Hildenbrand chipped in with "fs/proc/task_mmu: convert
hugetlb functions to work on folis".
- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".
- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
series "mm/ksm: fix ksm exec support for prctl".
- Baolin Wang has implemented NUMA balancing for multi-size THPs.
This is a simple first-cut implementation for now. The series is
"support multi-size THP numa balancing".
- Cleanups to vma handling helper functions from Matthew Wilcox in
the series "Unify vma_address and vma_pgoff_address".
- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".
- Improvements to the swapping of multi-size THPs from Ryan Roberts
in the series "Swap-out mTHP without splitting".
- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
- GUP cleanups from David Hildenbrand in "mm/gup: consistently call
it GUP-fast".
- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
path to use struct vm_fault".
- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".
- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes".
Fixes the initialization code so that migration between different
memory types works as intended.
- David Hildenbrand has improved follow_pte() and fixed an errant
driver in the series "mm: follow_pte() improvements and acrn
follow_pte() fixes".
- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".
- Folio conversions in KSM in Alex Shi's series "transfer page to
folio in KSM".
- Barry Song has added some sysfs stats for monitoring multi-size
THP's in the series "mm: add per-order mTHP alloc and swpout
counters".
- Some zswap cleanups from Yosry Ahmed in the series "zswap
same-filled and limit checking cleanups".
- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".
- Multi-size THPs get more work, this time from Lance Yang. His
series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
optimizes the freeing of these things.
- Kemeng Shi has added more userspace-visible writeback
instrumentation in the series "Improve visibility of writeback".
- Kemeng Shi then sent some maintenance work on top in the series
"Fix and cleanups to page-writeback".
- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
the series "Improve anon_vma scalability for anon VMAs". Intel's
test bot reported an improbable 3x improvement in one test.
- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"
- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"
- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as
XFAIL".
- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".
- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking""
* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
selftests: cgroup: add tests to verify the zswap writeback path
mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
mm/damon/core: fix return value from damos_wmark_metric_value
mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
selftests: cgroup: remove redundant enabling of memory controller
Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
Docs/mm/damon/design: use a list for supported filters
Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
selftests/damon: classify tests for functionalities and regressions
selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
selftests/damon: add a test for DAMOS quota goal
...
- Avoid 'constexpr', which is a keyword in C23
- Allow 'dtbs_check' and 'dt_compatible_check' run independently of
'dt_binding_check'
- Fix weak references to avoid GOT entries in position-independent
code generation
- Convert the last use of 'optional' property in arch/sh/Kconfig
- Remove support for the 'optional' property in Kconfig
- Remove support for Clang's ThinLTO caching, which does not work with
the .incbin directive
- Change the semantics of $(src) so it always points to the source
directory, which fixes Makefile inconsistencies between upstream and
downstream
- Fix 'make tar-pkg' for RISC-V to produce a consistent package
- Provide reasonable default coverage for objtool, sanitizers, and
profilers
- Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
- Remove the last use of tristate choice in drivers/rapidio/Kconfig
- Various cleanups and fixes in Kconfig
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX
2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa
msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj
GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi
inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ
lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv
JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm
Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw
y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL
orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2
aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd
ZCSnZ31h7woGfNho
=D5B/
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Avoid 'constexpr', which is a keyword in C23
- Allow 'dtbs_check' and 'dt_compatible_check' run independently of
'dt_binding_check'
- Fix weak references to avoid GOT entries in position-independent code
generation
- Convert the last use of 'optional' property in arch/sh/Kconfig
- Remove support for the 'optional' property in Kconfig
- Remove support for Clang's ThinLTO caching, which does not work with
the .incbin directive
- Change the semantics of $(src) so it always points to the source
directory, which fixes Makefile inconsistencies between upstream and
downstream
- Fix 'make tar-pkg' for RISC-V to produce a consistent package
- Provide reasonable default coverage for objtool, sanitizers, and
profilers
- Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
- Remove the last use of tristate choice in drivers/rapidio/Kconfig
- Various cleanups and fixes in Kconfig
* tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits)
kconfig: use sym_get_choice_menu() in sym_check_prop()
rapidio: remove choice for enumeration
kconfig: lxdialog: remove initialization with A_NORMAL
kconfig: m/nconf: merge two item_add_str() calls
kconfig: m/nconf: remove dead code to display value of bool choice
kconfig: m/nconf: remove dead code to display children of choice members
kconfig: gconf: show checkbox for choice correctly
kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal
Makefile: remove redundant tool coverage variables
kbuild: provide reasonable defaults for tool coverage
modules: Drop the .export_symbol section from the final modules
kconfig: use menu_list_for_each_sym() in sym_check_choice_deps()
kconfig: use sym_get_choice_menu() in conf_write_defconfig()
kconfig: add sym_get_choice_menu() helper
kconfig: turn defaults and additional prompt for choice members into error
kconfig: turn missing prompt for choice members into error
kconfig: turn conf_choice() into void function
kconfig: use linked list in sym_set_changed()
kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED
kconfig: gconf: remove debug code
...
- tracing/probes: Adding new pseudo-types %pd and %pD support for dumping
dentry name from 'struct dentry *' and file name from 'struct file *'.
- uprobes: Some performance optimizations have been done.
. Speed up the BPF uprobe event by delaying the fetching of the uprobe
event arguments that are not used in BPF.
. Avoid locking by speculatively checking whether uprobe event is valid.
. Reduce lock contention by using read/write_lock instead of spinlock for
uprobe list operation. This improved BPF uprobe benchmark result 43% on
average.
- rethook: Removes non-fatal warning messages when tracing stack from BPF
and skip rcu_is_watching() validation in rethook if possible.
- objpool: Optimizing objpool (which is used by kretprobes and fprobe as
rethook backend storage) by inlining functions and avoid caching nr_cpu_ids
because it is a const value.
- fprobe: Add entry/exit callbacks types (code cleanup)
- kprobes: Check ftrace was killed in kprobes if it uses ftrace.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmZFUxsbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b+fIH/A96/SeC5WRLhXmHfTCM
IvKUea2n0b0oV/2pVfHqfkCBTICuUZ97Opd9VH9jLtjBOTh0fUOGZ2DNVGdSYfWm
IIkS5dhuZxHXrSHEVYykwLHI3AOL7Q6Ny9EmOg1CNMidUkPMNtBvppsBYPlFU/B/
qQJAvOdkVOnNITCaas0+MNgepoVVKdJzdNQ1I4WrGyG8isCZBaCYKo2QcGyheCNN
y8NXvnVHgmgHQ8nTaeE5AawclFzFnhwHfPQPe1kiyGrx15b8K+VYmaZxPKv33A1a
KT3TKJ1Ep7s7iWFh2iPVJzIwOXCmSnvNTKfNx/MDuKtO7UVfFwytoMEaekbmv3bG
VqM=
=n/mW
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
- tracing/probes: Add new pseudo-types %pd and %pD support for dumping
dentry name from 'struct dentry *' and file name from 'struct file *'
- uprobes performance optimizations:
- Speed up the BPF uprobe event by delaying the fetching of the
uprobe event arguments that are not used in BPF
- Avoid locking by speculatively checking whether uprobe event is
valid
- Reduce lock contention by using read/write_lock instead of
spinlock for uprobe list operation. This improved BPF uprobe
benchmark result 43% on average
- rethook: Remove non-fatal warning messages when tracing stack from
BPF and skip rcu_is_watching() validation in rethook if possible
- objpool: Optimize objpool (which is used by kretprobes and fprobe as
rethook backend storage) by inlining functions and avoid caching
nr_cpu_ids because it is a const value
- fprobe: Add entry/exit callbacks types (code cleanup)
- kprobes: Check ftrace was killed in kprobes if it uses ftrace
* tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
kprobe/ftrace: bail out if ftrace was killed
selftests/ftrace: Fix required features for VFS type test case
objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
objpool: enable inlining objpool_push() and objpool_pop() operations
rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
ftrace: make extra rcu_is_watching() validation check optional
uprobes: reduce contention on uprobes_tree access
rethook: Remove warning messages printed for finding return address of a frame.
fprobe: Add entry/exit callbacks types
selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD"
selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"
Documentation: tracing: add new type '%pd' and '%pD' for kprobe
tracing/probes: support '%pD' type for print struct file's name
tracing/probes: support '%pd' type for print struct dentry's name
uprobes: add speculative lockless system-wide uprobe filter check
uprobes: prepare uprobe args buffer lazily
uprobes: encapsulate preparation of uprobe args buffer
If an error happens in ftrace, ftrace_kill() will prevent disarming
kprobes. Eventually, the ftrace_ops associated with the kprobes will be
freed, yet the kprobes will still be active, and when triggered, they
will use the freed memory, likely resulting in a page fault and panic.
This behavior can be reproduced quite easily, by creating a kprobe and
then triggering a ftrace_kill(). For simplicity, we can simulate an
ftrace error with a kernel module like [1]:
[1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer
sudo perf probe --add commit_creds
sudo perf trace -e probe:commit_creds
# In another terminal
make
sudo insmod ftrace_killer.ko # calls ftrace_kill(), simulating bug
# Back to perf terminal
# ctrl-c
sudo perf probe --del commit_creds
After a short period, a page fault and panic would occur as the kprobe
continues to execute and uses the freed ftrace_ops. While ftrace_kill()
is supposed to be used only in extreme circumstances, it is invoked in
FTRACE_WARN_ON() and so there are many places where an unexpected bug
could be triggered, yet the system may continue operating, possibly
without the administrator noticing. If ftrace_kill() does not panic the
system, then we should do everything we can to continue operating,
rather than leave a ticking time bomb.
Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
* Move a lot of state that was previously stored on a per vcpu
basis into a per-CPU area, because it is only pertinent to the
host while the vcpu is loaded. This results in better state
tracking, and a smaller vcpu structure.
* Add full handling of the ERET/ERETAA/ERETAB instructions in
nested virtualisation. The last two instructions also require
emulating part of the pointer authentication extension.
As a result, the trap handling of pointer authentication has
been greatly simplified.
* Turn the global (and not very scalable) LPI translation cache
into a per-ITS, scalable cache, making non directly injected
LPIs much cheaper to make visible to the vcpu.
* A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
* Allocate PPIs and SGIs outside of the vcpu structure, allowing
for smaller EL2 mapping and some flexibility in implementing
more or less than 32 private IRQs.
* Purge stale mpidr_data if a vcpu is created after the MPIDR
map has been created.
* Preserve vcpu-specific ID registers across a vcpu reset.
* Various minor cleanups and improvements.
LoongArch:
* Add ParaVirt IPI support.
* Add software breakpoint support.
* Add mmio trace events support.
RISC-V:
* Support guest breakpoints using ebreak
* Introduce per-VCPU mp_state_lock and reset_cntx_lock
* Virtualize SBI PMU snapshot and counter overflow interrupts
* New selftests for SBI PMU and Guest ebreak
* Some preparatory work for both TDX and SNP page fault handling.
This also cleans up the page fault path, so that the priorities
of various kinds of fauls (private page, no memory, write
to read-only slot, etc.) are easier to follow.
x86:
* Minimize amount of time that shadow PTEs remain in the special
REMOVED_SPTE state. This is a state where the mmu_lock is held for
reading but concurrent accesses to the PTE have to spin; shortening
its use allows other vCPUs to repopulate the zapped region while
the zapper finishes tearing down the old, defunct page tables.
* Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field,
which is defined by hardware but left for software use. This lets KVM
communicate its inability to map GPAs that set bits 51:48 on hosts
without 5-level nested page tables. Guest firmware is expected to
use the information when mapping BARs; this avoids that they end up at
a legal, but unmappable, GPA.
* Fixed a bug where KVM would not reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
* As usual, a bunch of code cleanups.
x86 (AMD):
* Implement a new and improved API to initialize SEV and SEV-ES VMs, which
will also be extendable to SEV-SNP. The new API specifies the desired
encryption in KVM_CREATE_VM and then separately initializes the VM.
The new API also allows customizing the desired set of VMSA features;
the features affect the measurement of the VM's initial state, and
therefore enabling them cannot be done tout court by the hypervisor.
While at it, the new API includes two bugfixes that couldn't be
applied to the old one without a flag day in userspace or without
affecting the initial measurement. When a SEV-ES VM is created with
the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are
rejected once the VMSA has been encrypted. Also, the FPU and AVX
state will be synchronized and encrypted too.
* Support for GHCB version 2 as applicable to SEV-ES guests. This, once
more, is only accessible when using the new KVM_SEV_INIT2 flow for
initialization of SEV-ES VMs.
x86 (Intel):
* An initial bunch of prerequisite patches for Intel TDX were merged.
They generally don't do anything interesting. The only somewhat user
visible change is a new debugging mode that checks that KVM's MMU
never triggers a #VE virtualization exception in the guest.
* Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to
L1, as per the SDM.
Generic:
* Use vfree() instead of kvfree() for allocations that always use vcalloc()
or __vcalloc().
* Remove .change_pte() MMU notifier - the changes to non-KVM code are
small and Andrew Morton asked that I also take those through the KVM
tree. The callback was only ever implemented by KVM (which was also the
original user of MMU notifiers) but it had been nonfunctional ever since
calls to set_pte_at_notify were wrapped with invalidate_range_start
and invalidate_range_end... in 2012.
Selftests:
* Enhance the demand paging test to allow for better reporting and stressing
of UFFD performance.
* Convert the steal time test to generate TAP-friendly output.
* Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed
time across two different clock domains.
* Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT.
* Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell
script, to play nice with running in a minimal userspace environment.
* Allow skipping the RSEQ test's sanity check that the vCPU was able to
complete a reasonable number of KVM_RUNs, as the assert can fail on a
completely valid setup. If the test is run on a large-ish system that is
otherwise idle, and the test isn't affined to a low-ish number of CPUs, the
vCPU task can be repeatedly migrated to CPUs that are in deep sleep states,
which results in the vCPU having very little net runtime before the next
migration due to high wakeup latencies.
* Define _GNU_SOURCE for all selftests to fix a warning that was introduced by
a change to kselftest_harness.h late in the 6.9 cycle, and because forcing
every test to #define _GNU_SOURCE is painful.
* Provide a global pseudo-RNG instance for all tests, so that library code can
generate random, but determinstic numbers.
* Use the global pRNG to randomly force emulation of select writes from guest
code on x86, e.g. to help validate KVM's emulation of locked accesses.
* Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception
handlers at VM creation, instead of forcing tests to manually trigger the
related setup.
Documentation:
* Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmZE878UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOukQf+LcvZsWtrC7Wd5K9SQbYXaS4Rk6P6
JHoQW2d0hUN893J2WibEw+l1J/0vn5JumqHXyZgJ7CbaMtXkWWQTwDSDLuURUKpv
XNB3Sb17G87NH+s1tOh0tA9h5upbtlHVHvrtIwdbb9+XHgQ6HTL4uk+HdfO/p9fW
cWBEZAKoWcCIa99Numv3pmq5vdrvBlNggwBugBS8TH69EKMw+V1Vu1SFkIdNDTQk
NJJ28cohoP3wnwlIHaXSmU4RujipPH3Lm/xupyA5MwmzO713eq2yUqV49jzhD5/I
MA4Ruvgrdm4wpp89N9lQMyci91u6q7R9iZfMu0tSg2qYI3UPKIdstd8sOA==
=2lED
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- Move a lot of state that was previously stored on a per vcpu basis
into a per-CPU area, because it is only pertinent to the host while
the vcpu is loaded. This results in better state tracking, and a
smaller vcpu structure.
- Add full handling of the ERET/ERETAA/ERETAB instructions in nested
virtualisation. The last two instructions also require emulating
part of the pointer authentication extension. As a result, the trap
handling of pointer authentication has been greatly simplified.
- Turn the global (and not very scalable) LPI translation cache into
a per-ITS, scalable cache, making non directly injected LPIs much
cheaper to make visible to the vcpu.
- A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
- Allocate PPIs and SGIs outside of the vcpu structure, allowing for
smaller EL2 mapping and some flexibility in implementing more or
less than 32 private IRQs.
- Purge stale mpidr_data if a vcpu is created after the MPIDR map has
been created.
- Preserve vcpu-specific ID registers across a vcpu reset.
- Various minor cleanups and improvements.
LoongArch:
- Add ParaVirt IPI support
- Add software breakpoint support
- Add mmio trace events support
RISC-V:
- Support guest breakpoints using ebreak
- Introduce per-VCPU mp_state_lock and reset_cntx_lock
- Virtualize SBI PMU snapshot and counter overflow interrupts
- New selftests for SBI PMU and Guest ebreak
- Some preparatory work for both TDX and SNP page fault handling.
This also cleans up the page fault path, so that the priorities of
various kinds of fauls (private page, no memory, write to read-only
slot, etc.) are easier to follow.
x86:
- Minimize amount of time that shadow PTEs remain in the special
REMOVED_SPTE state.
This is a state where the mmu_lock is held for reading but
concurrent accesses to the PTE have to spin; shortening its use
allows other vCPUs to repopulate the zapped region while the zapper
finishes tearing down the old, defunct page tables.
- Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID
field, which is defined by hardware but left for software use.
This lets KVM communicate its inability to map GPAs that set bits
51:48 on hosts without 5-level nested page tables. Guest firmware
is expected to use the information when mapping BARs; this avoids
that they end up at a legal, but unmappable, GPA.
- Fixed a bug where KVM would not reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
- As usual, a bunch of code cleanups.
x86 (AMD):
- Implement a new and improved API to initialize SEV and SEV-ES VMs,
which will also be extendable to SEV-SNP.
The new API specifies the desired encryption in KVM_CREATE_VM and
then separately initializes the VM. The new API also allows
customizing the desired set of VMSA features; the features affect
the measurement of the VM's initial state, and therefore enabling
them cannot be done tout court by the hypervisor.
While at it, the new API includes two bugfixes that couldn't be
applied to the old one without a flag day in userspace or without
affecting the initial measurement. When a SEV-ES VM is created with
the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected
once the VMSA has been encrypted. Also, the FPU and AVX state will
be synchronized and encrypted too.
- Support for GHCB version 2 as applicable to SEV-ES guests.
This, once more, is only accessible when using the new
KVM_SEV_INIT2 flow for initialization of SEV-ES VMs.
x86 (Intel):
- An initial bunch of prerequisite patches for Intel TDX were merged.
They generally don't do anything interesting. The only somewhat
user visible change is a new debugging mode that checks that KVM's
MMU never triggers a #VE virtualization exception in the guest.
- Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig
VM-Exit to L1, as per the SDM.
Generic:
- Use vfree() instead of kvfree() for allocations that always use
vcalloc() or __vcalloc().
- Remove .change_pte() MMU notifier - the changes to non-KVM code are
small and Andrew Morton asked that I also take those through the
KVM tree.
The callback was only ever implemented by KVM (which was also the
original user of MMU notifiers) but it had been nonfunctional ever
since calls to set_pte_at_notify were wrapped with
invalidate_range_start and invalidate_range_end... in 2012.
Selftests:
- Enhance the demand paging test to allow for better reporting and
stressing of UFFD performance.
- Convert the steal time test to generate TAP-friendly output.
- Fix a flaky false positive in the xen_shinfo_test due to comparing
elapsed time across two different clock domains.
- Skip the MONITOR/MWAIT test if the host doesn't actually support
MWAIT.
- Avoid unnecessary use of "sudo" in the NX hugepage test wrapper
shell script, to play nice with running in a minimal userspace
environment.
- Allow skipping the RSEQ test's sanity check that the vCPU was able
to complete a reasonable number of KVM_RUNs, as the assert can fail
on a completely valid setup.
If the test is run on a large-ish system that is otherwise idle,
and the test isn't affined to a low-ish number of CPUs, the vCPU
task can be repeatedly migrated to CPUs that are in deep sleep
states, which results in the vCPU having very little net runtime
before the next migration due to high wakeup latencies.
- Define _GNU_SOURCE for all selftests to fix a warning that was
introduced by a change to kselftest_harness.h late in the 6.9
cycle, and because forcing every test to #define _GNU_SOURCE is
painful.
- Provide a global pseudo-RNG instance for all tests, so that library
code can generate random, but determinstic numbers.
- Use the global pRNG to randomly force emulation of select writes
from guest code on x86, e.g. to help validate KVM's emulation of
locked accesses.
- Allocate and initialize x86's GDT, IDT, TSS, segments, and default
exception handlers at VM creation, instead of forcing tests to
manually trigger the related setup.
Documentation:
- Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits)
selftests/kvm: remove dead file
KVM: selftests: arm64: Test vCPU-scoped feature ID registers
KVM: selftests: arm64: Test that feature ID regs survive a reset
KVM: selftests: arm64: Store expected register value in set_id_regs
KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope
KVM: arm64: Only reset vCPU-scoped feature ID regs once
KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()
KVM: arm64: Rename is_id_reg() to imply VM scope
KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
KVM: arm64: Fix hvhe/nvhe early alias parsing
KVM: SEV: Allow per-guest configuration of GHCB protocol version
KVM: SEV: Add GHCB handling for termination requests
KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests
KVM: SEV: Add support to handle AP reset MSR protocol
KVM: x86: Explicitly zero kvm_caps during vendor module load
KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
KVM: x86: Fully re-initialize supported_vm_types on vendor module load
KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns
KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
...
Now Kbuild provides reasonable defaults for objtool, sanitizers, and
profilers.
Remove redundant variables.
Note:
This commit changes the coverage for some objects:
- include arch/mips/vdso/vdso-image.o into UBSAN, GCOV, KCOV
- include arch/sparc/vdso/vdso-image-*.o into UBSAN
- include arch/sparc/vdso/vma.o into UBSAN
- include arch/x86/entry/vdso/extable.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
- include arch/x86/entry/vdso/vdso-image-*.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
- include arch/x86/entry/vdso/vdso32-setup.o into KASAN, KCSAN, UBSAN, GCOV, KCOV
- include arch/x86/entry/vdso/vma.o into GCOV, KCOV
- include arch/x86/um/vdso/vma.o into KASAN, GCOV, KCOV
I believe these are positive effects because all of them are kernel
space objects.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Roberto Sassu <roberto.sassu@huawei.com>
execmem does not depend on modules, on the contrary modules use
execmem.
To make execmem available when CONFIG_MODULES=n, for instance for
kprobes, split execmem_params initialization out from
arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
The memory allocations for kprobes and BPF on RISC-V are not placed in
the modules area and these custom allocations are implemented with
overrides of alloc_insn_page() and bpf_jit_alloc_exec().
Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END for
32 bit and slightly reorder execmem_params initialization to support both
32 and 64 bit variants, define EXECMEM_KPROBES and EXECMEM_BPF ranges in
riscv::execmem_params and drop overrides of alloc_insn_page() and
bpf_jit_alloc_exec().
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Several architectures override module_alloc() only to define address
range for code allocations different than VMALLOC address space.
Provide a generic implementation in execmem that uses the parameters for
address space ranges, required alignment and page protections provided
by architectures.
The architectures must fill execmem_info structure and implement
execmem_arch_setup() that returns a pointer to that structure. This way the
execmem initialization won't be called from every architecture, but rather
from a central place, namely a core_initcall() in execmem.
The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
with the parameters defined by the architectures. If an architecture does
not implement execmem_arch_setup(), execmem_alloc() will fall back to
module_alloc().
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for
checked-in source files. It is merely a convention without any functional
difference. In fact, $(obj) and $(src) are exactly the same, as defined
in scripts/Makefile.build:
src := $(obj)
When the kernel is built in a separate output directory, $(src) does
not accurately reflect the source directory location. While Kbuild
resolves this discrepancy by specifying VPATH=$(srctree) to search for
source files, it does not cover all cases. For example, when adding a
header search path for local headers, -I$(srctree)/$(src) is typically
passed to the compiler.
This introduces inconsistency between upstream and downstream Makefiles
because $(src) is used instead of $(srctree)/$(src) for the latter.
To address this inconsistency, this commit changes the semantics of
$(src) so that it always points to the directory in the source tree.
Going forward, the variables used in Makefiles will have the following
meanings:
$(obj) - directory in the object tree
$(src) - directory in the source tree (changed by this commit)
$(objtree) - the top of the kernel object tree
$(srctree) - the top of the kernel source tree
Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced
with $(src).
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
Samuel Holland <samuel.holland@sifive.com> says:
This series converts uniprocessor kernel builds to use the same TLB
flushing code as SMP builds, to take advantage of batching and existing
range- and ASID-based TLB flush optimizations. It optimizes out IPIs and
SBI calls based on the online CPU count, which also covers the scenario
where SMP was enabled at build time but only one CPU is present/online.
A final optimization is to use single-ASID flushes wherever possible, to
avoid unnecessary TLB misses for kernel mappings.
This series has a semantic conflict with the AIA patches that are in
linux-next due to the removal of the third parameter of
riscv_ipi_set_virq_range(), which is called from imsic_ipi_domain_init()
in drivers/irqchip/irq-riscv-imsic-early.c. The resolution is to remove
the extra argument from the call site.
Here are some numbers from D1 which show the performance impact:
v6.9-rc1:
System Benchmarks Partial Index BASELINE RESULT INDEX
Execl Throughput 43.0 198.5 46.2
File Copy 1024 bufsize 2000 maxblocks 3960.0 73934.4 186.7
File Copy 256 bufsize 500 maxblocks 1655.0 20242.6 122.3
File Copy 4096 bufsize 8000 maxblocks 5800.0 197706.4 340.9
Pipe Throughput 12440.0 176974.2 142.3
Pipe-based Context Switching 4000.0 23626.8 59.1
Process Creation 126.0 449.9 35.7
Shell Scripts (1 concurrent) 42.4 544.4 128.4
Shell Scripts (16 concurrent) --- 35.3 ---
Shell Scripts (8 concurrent) 6.0 71.6 119.3
System Call Overhead 15000.0 248072.6 165.4
========
System Benchmarks Index Score (Partial Only) 110.6
v6.9-rc1 + this patch series:
System Benchmarks Partial Index BASELINE RESULT INDEX
Execl Throughput 43.0 196.8 45.8
File Copy 1024 bufsize 2000 maxblocks 3960.0 71782.2 181.3
File Copy 256 bufsize 500 maxblocks 1655.0 21269.4 128.5
File Copy 4096 bufsize 8000 maxblocks 5800.0 199424.0 343.8
Pipe Throughput 12440.0 196468.6 157.9
Pipe-based Context Switching 4000.0 24261.8 60.7
Process Creation 126.0 459.0 36.4
Shell Scripts (1 concurrent) 42.4 543.8 128.2
Shell Scripts (16 concurrent) --- 35.5 ---
Shell Scripts (8 concurrent) 6.0 71.7 119.6
System Call Overhead 15000.0 259415.2 172.9
========
System Benchmarks Index Score (Partial Only) 113.0
* b4-shazam-lts:
riscv: mm: Always use an ASID to flush mm contexts
riscv: mm: Preserve global TLB entries when switching contexts
riscv: mm: Make asid_bits a local variable
riscv: mm: Use a fixed layout for the MM context ID
riscv: mm: Introduce cntx2asid/cntx2version helper macros
riscv: Avoid TLB flush loops when affected by SiFive CIP-1200
riscv: Apply SiFive CIP-1200 workaround to single-ASID sfence.vma
riscv: mm: Combine the SMP and UP TLB flush code
riscv: Only send remote fences when some other CPU is online
riscv: mm: Broadcast kernel TLB flushes only when needed
riscv: Use IPIs for remote cache/TLB flushes by default
riscv: Factor out page table TLB synchronization
riscv: Flush the instruction cache during SMP bringup
Link: https://lore.kernel.org/r/20240327045035.368512-1-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Alexandre Ghiti <alexghiti@rivosinc.com> says:
patch 1 removes a useless memory barrier and patch 2 actually fixes the
issue with IPI in the patching code.
* b4-shazam-merge:
riscv: Fix text patching when IPI are used
riscv: Remove superfluous smp_mb()
Link: https://lore.kernel.org/r/20240229121056.203419-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
An IPI backend is always required in an SMP configuration, but an SBI
implementation is not. For example, SBI will be unavailable when the
kernel runs in M mode. For this reason, consider IPI delivery of cache
and TLB flushes to be the base case, and any other implementation (such
as the SBI remote fence extension) to be an optimization.
Generally, if IPIs can be delivered without firmware assistance, they
are assumed to be faster than SBI calls due to the SBI context switch
overhead. However, when SBI is used as the IPI backend, then the context
switch cost must be paid anyway, and performing the cache/TLB flush
directly in the SBI implementation is more efficient than injecting an
interrupt to S-mode. This is the only existing scenario where
riscv_ipi_set_virq_range() is called with use_for_rfence set to false.
sbi_ipi_init() already checks riscv_ipi_have_virq_range(), so it only
calls riscv_ipi_set_virq_range() when no other IPI device is available.
This allows moving the static key and dropping the use_for_rfence
parameter. This decouples the static key from the irqchip driver probe
order.
Furthermore, the static branch only makes sense when CONFIG_RISCV_SBI is
enabled. Optherwise, IPIs must be used. Add a fallback definition of
riscv_use_sbi_for_rfence() which handles this case and removes the need
to check CONFIG_RISCV_SBI elsewhere, such as in cacheflush.c.
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240327045035.368512-4-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Instruction cache flush IPIs are sent only to CPUs in cpu_online_mask,
so they will not target a CPU until it calls set_cpu_online() earlier in
smp_callin(). As a result, if instruction memory is modified between the
CPU coming out of reset and that point, then its instruction cache may
contain stale data. Therefore, the instruction cache must be flushed
after the set_cpu_online() synchronization point.
Fixes: 08f051eda3 ("RISC-V: Flush I$ when making a dirty page executable")
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20240327045035.368512-2-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Export the Zihintpause ISA extension through hwprobe which allows using
"pause" instructions. Some userspace applications (OpenJDK for
instance) uses this to handle some locking back-off.
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240221083108.1235311-1-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
While reworking code to fix sparse errors, it appears that the
RISCV_M_MODE specific could actually be removed and use the one for
normal mode. Even though RISCV_M_MODE can do direct user memory access,
using the user uaccess helpers is also going to work. Since there is no
need anymore for specific accessors (load_u8()/store_u8()), we can
directly use memcpy()/copy_{to/from}_user() and get rid of the copy
loop entirely. __read_insn() is also fixed to use an unsigned long
instead of a pointer which was cast in __user address space. The
insn_addr parameter is now cast from unsigned lnog to the correct
address space directly.
Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20240206154104.896809-1-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
While the processor is executing kernel code, the value of the scratch
CSR is always zero, so there is no need to save the value. Continue to
write the CSR during the resume flow, so we do not rely on firmware to
initialize it.
Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20240312195641.1830521-1-samuel.holland@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
SBI_STA_SHMEM_DISABLE is a macro to invoke disable shared memory
commands. As this can be invoked from other SBI extension context
as well, rename it to more generic name as SBI_SHMEM_DISABLE.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-7-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
For now, we use stop_machine() to patch the text and when we use IPIs for
remote icache flushes (which is emitted in patch_text_nosync()), the system
hangs.
So instead, make sure every CPU executes the stop_machine() patching
function and emit a local icache flush there.
Co-developed-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Andrea Parri <parri.andrea@gmail.com>
Link: https://lore.kernel.org/r/20240229121056.203419-3-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
This memory barrier is not needed and not documented so simply remove
it.
Suggested-by: Andrea Parri <andrea@rivosinc.com>
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Andrea Parri <parri.andrea@gmail.com>
Link: https://lore.kernel.org/r/20240229121056.203419-2-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
childregs represents the registers which are active for the new thread
in user context. For a kernel thread, childregs->gp is never used since
the kernel gp is not touched by switch_to. For a user mode helper, the
gp value can be observed in user space after execve or possibly by other
means.
[From the email thread]
The /* Kernel thread */ comment is somewhat inaccurate in that it is also used
for user_mode_helper threads, which exec a user process, e.g. /sbin/init or
when /proc/sys/kernel/core_pattern is a pipe. Such threads do not have
PF_KTHREAD set and are valid targets for ptrace etc. even before they exec.
childregs is the *user* context during syscall execution and it is observable
from userspace in at least five ways:
1. kernel_execve does not currently clear integer registers, so the starting
register state for PID 1 and other user processes started by the kernel has
sp = user stack, gp = kernel __global_pointer$, all other integer registers
zeroed by the memset in the patch comment.
This is a bug in its own right, but I'm unwilling to bet that it is the only
way to exploit the issue addressed by this patch.
2. ptrace(PTRACE_GETREGSET): you can PTRACE_ATTACH to a user_mode_helper thread
before it execs, but ptrace requires SIGSTOP to be delivered which can only
happen at user/kernel boundaries.
3. /proc/*/task/*/syscall: this is perfectly happy to read pt_regs for
user_mode_helpers before the exec completes, but gp is not one of the
registers it returns.
4. PERF_SAMPLE_REGS_USER: LOCKDOWN_PERF normally prevents access to kernel
addresses via PERF_SAMPLE_REGS_INTR, but due to this bug kernel addresses
are also exposed via PERF_SAMPLE_REGS_USER which is permitted under
LOCKDOWN_PERF. I have not attempted to write exploit code.
5. Much of the tracing infrastructure allows access to user registers. I have
not attempted to determine which forms of tracing allow access to user
registers without already allowing access to kernel registers.
Fixes: 7db91e57a0 ("RISC-V: Task implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Stefan O'Rear <sorear@fastmail.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240327061258.2370291-1-sorear@fastmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
patch_map() uses fixmap mappings to circumvent the non-writability of
the kernel text mapping.
The __set_fixmap() function only flushes the current cpu tlb, it does
not emit an IPI so we must make sure that while we use a fixmap mapping,
the current task is not migrated on another cpu which could miss the
newly introduced fixmap mapping.
So in order to avoid any task migration, disable the preemption.
Reported-by: Andrea Parri <andrea@rivosinc.com>
Closes: https://lore.kernel.org/all/ZcS+GAaM25LXsBOl@andrea/
Reported-by: Andy Chiu <andy.chiu@sifive.com>
Closes: https://lore.kernel.org/linux-riscv/CABgGipUMz3Sffu-CkmeUB1dKVwVQ73+7=sgC45-m0AE9RCjOZg@mail.gmail.com/
Fixes: cad539baa4 ("riscv: implement a memset like function for text")
Fixes: 0ff7c3b331 ("riscv: Use text_mutex instead of patch_lock")
Co-developed-by: Andy Chiu <andy.chiu@sifive.com>
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Acked-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240326203017.310422-3-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Print the instruction dump with info instead of emergency level. The
unhandled signal message is only for informational purpose.
Fixes: b8a03a6341 ("riscv: add userland instruction dump to RISC-V splats")
Signed-off-by: Andreas Schwab <schwab@suse.de>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Yunhui Cui <cuiyunhui@bytedance.com>
Link: https://lore.kernel.org/r/mvmy1aegrhm.fsf@suse.de
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The RISC-V Vector specification states in "Appendix D: Calling
Convention for Vector State" [1] that "Executing a system call causes
all caller-saved vector registers (v0-v31, vl, vtype) and vstart to
become unspecified.". In the RISC-V kernel this is called "discarding
the vstate".
Returning from a signal handler via the rt_sigreturn() syscall, vector
discard is also performed. However, this is not an issue since the
vector state should be restored from the sigcontext, and therefore not
care about the vector discard.
The "live state" is the actual vector register in the running context,
and the "vstate" is the vector state of the task. A dirty live state,
means that the vstate and live state are not in synch.
When vectorized user_from_copy() was introduced, an bug sneaked in at
the restoration code, related to the discard of the live state.
An example when this go wrong:
1. A userland application is executing vector code
2. The application receives a signal, and the signal handler is
entered.
3. The application returns from the signal handler, using the
rt_sigreturn() syscall.
4. The live vector state is discarded upon entering the
rt_sigreturn(), and the live state is marked as "dirty", indicating
that the live state need to be synchronized with the current
vstate.
5. rt_sigreturn() restores the vstate, except the Vector registers,
from the sigcontext
6. rt_sigreturn() restores the Vector registers, from the sigcontext,
and now the vectorized user_from_copy() is used. The dirty live
state from the discard is saved to the vstate, making the vstate
corrupt.
7. rt_sigreturn() returns to the application, which crashes due to
corrupted vstate.
Note that the vectorized user_from_copy() is invoked depending on the
value of CONFIG_RISCV_ISA_V_UCOPY_THRESHOLD. Default is 768, which
means that vlen has to be larger than 128b for this bug to trigger.
The fix is simply to mark the live state as non-dirty/clean prior
performing the vstate restore.
Link: https://github.com/riscv/riscv-isa-manual/releases/download/riscv-isa-release-8abdb41-2024-03-26/unpriv-isa-asciidoc.pdf # [1]
Reported-by: Charlie Jenkins <charlie@rivosinc.com>
Reported-by: Vineet Gupta <vgupta@kernel.org>
Fixes: c2a658d419 ("riscv: lib: vectorize copy_to_user/copy_from_user")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Andy Chiu <andy.chiu@sifive.com>
Tested-by: Vineet Gupta <vineetg@rivosinc.com>
Link: https://lore.kernel.org/r/20240403072638.567446-1-bjorn@kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
arch/riscv/kernel/sys_riscv.c builds without using anything
from asm-generic/mman-common.h. There seems to be no reason
to include this file.
Signed-off-by: Tanzir Hasan <tanzirh@google.com>
Fixes: 9e2e6042a7 ("riscv: Allow PROT_WRITE-only mmap()")
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20231212-removeasmgeneric-v2-1-a0e6d7df34a7@google.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Such relocation causes crash of android linker similar to one
described in commit e05d57dcb8
("riscv: Fixup __vdso_gettimeofday broke dynamic ftrace").
Looks like this relocation is added by CONFIG_DYNAMIC_FTRACE which is
disabled in the default android kernel.
Before:
readelf -rW arch/riscv/kernel/vdso/vdso.so:
Relocation section '.rela.dyn' at offset 0xd00 contains 1 entry:
Offset Info Type
0000000000000d20 0000000000000003 R_RISCV_RELATIVE
objdump:
0000000000000c86 <__vdso_riscv_hwprobe@@LINUX_4.15>:
c86: 0001 nop
c88: 0001 nop
c8a: 0001 nop
c8c: 0001 nop
c8e: e211 bnez a2,c92 <__vdso_riscv_hwprobe...
After:
readelf -rW arch/riscv/kernel/vdso/vdso.so:
There are no relocations in this file.
objdump:
0000000000000c86 <__vdso_riscv_hwprobe@@LINUX_4.15>:
c86: e211 bnez a2,c8a <__vdso_riscv_hwprobe...
c88: c6b9 beqz a3,cd6 <__vdso_riscv_hwprobe...
c8a: e739 bnez a4,cd8 <__vdso_riscv_hwprobe...
c8c: ffffd797 auipc a5,0xffffd
Also disable SCS since it also should not be available in vdso.
Fixes: aa5af0aa90 ("RISC-V: Add hwprobe vDSO function and data")
Signed-off-by: Roman Artemev <roman.artemev@syntacore.com>
Signed-off-by: Vladimir Isaev <vladimir.isaev@syntacore.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Guo Ren <guoren@kernel.org>
Link: https://lore.kernel.org/r/20240313085843.17661-1-vladimir.isaev@syntacore.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
* Support for various vector-accelerated crypto routines.
* Hibernation is now enabled for portable kernel builds.
* mmap_rnd_bits_max is larger on systems with larger VAs.
* Support for fast GUP.
* Support for membarrier-based instruction cache synchronization.
* Support for the Andes hart-level interrupt controller and PMU.
* Some cleanups around unaligned access speed probing and Kconfig
settings.
* Support for ACPI LPI and CPPC.
* Various cleanus related to barriers.
* A handful of fixes.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmX9icgTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYib+UD/4xyL6UMixx6A06BVBL9UT4vOrxRvNr
JIihG5y5QNMjes9DHWL35mZTMqFtQ0tq94ViWFLmJWloV/8KRVM2C9R9KX7vplf3
M/OwvP106spxgvNHoeQbycgs42RU1t2mpqT7N1iK2hCjqieP3vLn6hsSLXWTAG0L
3gQbQw6XCLC3hPyLq+nbFY2i4faeCmpXWmixoy/IvQ5calZQrRU0LNlP6lcMBhVo
uocjG0uGAhrahw2s81jxcMZcxa3AvUCiplapdD5H5v9rBM85SkYJj2Q9SqdSorkb
xzuimRnKPI5s47yM3pTfZY0qnQUYHV7PXXuw4WujpCQVQdhaG+Ggq63UUZA61J9t
IzZK2zdcfHqICrGTtXImUzRT3dcc3oq+IFq4tTY+rEJm29hrXkAtx+qBm5xtMvax
fJz5feJ/iT0u7MDj4Oq24n+Kpl+Olm+MJaZX3m5Ovi/9V6a9iK9HXqxg9/Fs0fMO
+J/0kTgd8Vu9CYH7KNWz3uztcO9eMAH3VyzuXuab4BGj1i1Y/9EjpALQi7rDN73S
OsYQX6NnzMkBV4dvElJVLXiPlvNlMHZZwdak5CqPb48jaJu6iiIZAuvOrG6/naGP
wnQSLVA2WWWoOkl3AJhxfpa11CLhbMl9E2gYm1VtNvASXoSFIxlAq1Yv3sG8yjty
4ZT0rYFJOstYiQ==
=3dL5
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:
- Support for various vector-accelerated crypto routines
- Hibernation is now enabled for portable kernel builds
- mmap_rnd_bits_max is larger on systems with larger VAs
- Support for fast GUP
- Support for membarrier-based instruction cache synchronization
- Support for the Andes hart-level interrupt controller and PMU
- Some cleanups around unaligned access speed probing and Kconfig
settings
- Support for ACPI LPI and CPPC
- Various cleanus related to barriers
- A handful of fixes
* tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits)
riscv: Fix syscall wrapper for >word-size arguments
crypto: riscv - add vector crypto accelerated AES-CBC-CTS
crypto: riscv - parallelize AES-CBC decryption
riscv: Only flush the mm icache when setting an exec pte
riscv: Use kcalloc() instead of kzalloc()
riscv/barrier: Add missing space after ','
riscv/barrier: Consolidate fence definitions
riscv/barrier: Define RISCV_FULL_BARRIER
riscv/barrier: Define __{mb,rmb,wmb}
RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ
cpufreq: Move CPPC configs to common Kconfig and add RISC-V
ACPI: RISC-V: Add CPPC driver
ACPI: Enable ACPI_PROCESSOR for RISC-V
ACPI: RISC-V: Add LPI driver
cpuidle: RISC-V: Move few functions to arch/riscv
riscv: Introduce set_compat_task() in asm/compat.h
riscv: Introduce is_compat_thread() into compat.h
riscv: add compile-time test into is_compat_task()
riscv: Replace direct thread flag check with is_compat_task()
riscv: Improve arch_get_mmap_end() macro
...
- Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)
- Use more threads when building Debian packages in parallel
- Fix warnings shown during the RPM kernel package uninstallation
- Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
Makefile
- Support GCC's -fmin-function-alignment flag
- Fix a null pointer dereference bug in modpost
- Add the DTB support to the RPM package
- Various fixes and cleanups in Kconfig
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmX8HGIVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGYfIQAIl/zEFoNVSHGR4TIvO7SIwkT4MM
VAm0W6XRFaXfIGw8HL/MXe+U9jAyeQ9yL9uUVv8PqFTO+LzBbW1X1X97tlmrlQsC
7mdxbA1KJXwkwt4wH/8/EZQMwHr327vtVH4AilSm+gAaWMXaSKAye3ulKQQ2gevz
vP6aOcfbHIWOPdxA53cLdSl9LOGrYNczKySHXKV9O39T81F+ko7wPpdkiMWw5LWG
ISRCV8bdXli8j10Pmg8jlbevSKl4Z5FG2BVw/Cl8rQ5tBBoCzFsUPnnp9A29G8QP
OqRhbwxtkSm67BMJAYdHnhjp/l0AOEbmetTGpna+R06hirOuXhR3vc6YXZxhQjff
LmKaqfG5YchRALS1fNDsRUNIkQxVJade+tOUG+V4WbxHQKWX7Ghu5EDlt2/x7P0p
+XLPE48HoNQLQOJ+pgIOkaEDl7WLfGhoEtEgprZBuEP2h39xcdbYJyF10ZAAR4UZ
FF6J9lDHbf7v1uqD2YnAQJQ6jJ06CvN6/s6SdiJnCWSs5cYRW0fnYigSIuwAgGHZ
c/QFECoGEflXGGuqZDl5iXiIjhWKzH2nADSVEs7maP47vapcMWb9gA7VBNoOr5M0
IXuFo1khChF4V2pxqlDj3H5TkDlFENYT/Wjh+vvjx8XplKCRKaSh+LaZ39hja61V
dWH7BPecS44h4KXx
=tFdl
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)
- Use more threads when building Debian packages in parallel
- Fix warnings shown during the RPM kernel package uninstallation
- Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
Makefile
- Support GCC's -fmin-function-alignment flag
- Fix a null pointer dereference bug in modpost
- Add the DTB support to the RPM package
- Various fixes and cleanups in Kconfig
* tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits)
kconfig: tests: test dependency after shuffling choices
kconfig: tests: add a test for randconfig with dependent choices
kconfig: tests: support KCONFIG_SEED for the randconfig runner
kbuild: rpm-pkg: add dtb files in kernel rpm
kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig()
kconfig: check prompt for choice while parsing
kconfig: lxdialog: remove unused dialog colors
kconfig: lxdialog: fix button color for blackbg theme
modpost: fix null pointer dereference
kbuild: remove GCC's default -Wpacked-bitfield-compat flag
kbuild: unexport abs_srctree and abs_objtree
kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1
kconfig: remove named choice support
kconfig: use linked list in get_symbol_str() to iterate over menus
kconfig: link menus to a symbol
kbuild: fix inconsistent indentation in top Makefile
kbuild: Use -fmin-function-alignment when available
alpha: merge two entries for CONFIG_ALPHA_GAMMA
alpha: merge two entries for CONFIG_ALPHA_EV4
kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj)
...
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.
So, use the purpose specific kcalloc() function instead of the argument
count * size in the kzalloc() function.
Also, it is preferred to use sizeof(*pointer) instead of sizeof(type)
due to the type of the variable can change and one needs not change the
former (unlike the latter).
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
Link: https://github.com/KSPP/linux/issues/162
Signed-off-by: Erick Archer <erick.archer@gmx.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20240120135400.4710-1-erick.archer@gmx.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Leonardo Bras <leobras@redhat.com> says:
I just saw the opportunity of optimizing the helper is_compat_task() by
introducing a compile-time test, and it made possible to remove some
#ifdef's without any loss of performance.
I also saw the possibility of removing the direct check of task flags from
general code, and concentrated it in asm/compat.h by creating a few more
helpers, which in the end helped optimize code.
arch_get_mmap_end() just got a simple improvement and some extra docs.
* b4-shazam-merge:
riscv: Introduce set_compat_task() in asm/compat.h
riscv: Introduce is_compat_thread() into compat.h
riscv: add compile-time test into is_compat_task()
riscv: Replace direct thread flag check with is_compat_task()
riscv: Improve arch_get_mmap_end() macro
Link: https://lore.kernel.org/r/20240103160024.70305-2-leobras@redhat.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
To support ACPI Low Power Idle (LPI), few functions are required which
are currently static functions in the DT based cpuidle driver. Hence,
move them under arch/riscv so that ACPI driver also can use them. Since
they are no longer static functions, append "riscv_" prefix to the
function name.
Signed-off-by: Sunil V L <sunilvl@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20240118062930.245937-2-sunilvl@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
task_user_regset_view() makes use of a function very similar to
is_compat_task(), but pointing to a any thread.
In arm64 asm/compat.h there is a function very similar to that:
is_compat_thread(struct thread_info *thread)
Copy this function to riscv asm/compat.h and make use of it into
task_user_regset_view().
Also, introduce a compile-time test for CONFIG_COMPAT and simplify the
function code by removing the #ifdef.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Andy Chiu <andy.chiu@sifive.com>
Link: https://lore.kernel.org/r/20240103160024.70305-6-leobras@redhat.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Charlie Jenkins <charlie@rivosinc.com> says:
If the hardware unaligned access speed is known at compile time, it is
possible to avoid running the unaligned access speed probe to speedup
boot-time.
* b4-shazam-merge:
riscv: Set unaligned access speed at compile time
riscv: Decouple emulated unaligned accesses from access speed
riscv: Only check online cpus for emulated accesses
riscv: lib: Introduce has_fast_unaligned_access()
Link: https://lore.kernel.org/r/20240308-disable_misaligned_probe_config-v9-0-a388770ba0ce@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Yu Chien Peter Lin <peterlin@andestech.com> says:
This patch series introduces the Andes PMU extension, which serves the
same purpose as Sscofpmf and Smcntrpmf. Its non-standard local interrupt
is assigned to bit 18 in the custom S-mode local interrupt enable and
pending registers (slie/slip), while the interrupt cause is (256 + 18).
* b4-shazam-merge:
riscv: andes: Support specifying symbolic firmware and hardware raw events
riscv: dts: renesas: Add Andes PMU extension for r9a07g043f
dt-bindings: riscv: Add Andes PMU extension description
perf: RISC-V: Introduce Andes PMU to support perf event sampling
perf: RISC-V: Eliminate redundant interrupt enable/disable operations
riscv: dts: renesas: r9a07g043f: Update compatible string to use Andes INTC
dt-bindings: riscv: Add Andes interrupt controller compatible string
riscv: errata: Rename defines for Andes
Link: https://lore.kernel.org/r/20240222083946.3977135-1-peterlin@andestech.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
heap optimizations".
- Kuan-Wei Chiu has also sped up the library sorting code in the series
"lib/sort: Optimize the number of swaps and comparisons".
- Alexey Gladkov has added the ability for code running within an IPC
namespace to alter its IPC and MQ limits. The series is "Allow to
change ipc/mq sysctls inside ipc namespace".
- Geert Uytterhoeven has contributed some dhrystone maintenance work in
the series "lib: dhry: miscellaneous cleanups".
- Ryusuke Konishi continues nilfs2 maintenance work in the series
"nilfs2: eliminate kmap and kmap_atomic calls"
"nilfs2: fix kernel bug at submit_bh_wbc()"
- Nathan Chancellor has updated our build tools requirements in the
series "Bump the minimum supported version of LLVM to 13.0.1".
- Muhammad Usama Anjum continues with the selftests maintenance work in
the series "selftests/mm: Improve run_vmtests.sh".
- Oleg Nesterov has done some maintenance work against the signal code
in the series "get_signal: minor cleanups and fix".
Plus the usual shower of singleton patches in various parts of the tree.
Please see the individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfMnvgAKCRDdBJ7gKXxA
jjKMAP4/Upq07D4wjkMVPb+QrkipbbLpdcgJ++q3z6rba4zhPQD+M3SFriIJk/Xh
tKVmvihFxfAhdDthseXcIf1nBjMALwY=
=8rVc
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
heap optimizations".
- Kuan-Wei Chiu has also sped up the library sorting code in the series
"lib/sort: Optimize the number of swaps and comparisons".
- Alexey Gladkov has added the ability for code running within an IPC
namespace to alter its IPC and MQ limits. The series is "Allow to
change ipc/mq sysctls inside ipc namespace".
- Geert Uytterhoeven has contributed some dhrystone maintenance work in
the series "lib: dhry: miscellaneous cleanups".
- Ryusuke Konishi continues nilfs2 maintenance work in the series
"nilfs2: eliminate kmap and kmap_atomic calls"
"nilfs2: fix kernel bug at submit_bh_wbc()"
- Nathan Chancellor has updated our build tools requirements in the
series "Bump the minimum supported version of LLVM to 13.0.1".
- Muhammad Usama Anjum continues with the selftests maintenance work in
the series "selftests/mm: Improve run_vmtests.sh".
- Oleg Nesterov has done some maintenance work against the signal code
in the series "get_signal: minor cleanups and fix".
Plus the usual shower of singleton patches in various parts of the tree.
Please see the individual changelogs for details.
* tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
nilfs2: prevent kernel bug at submit_bh_wbc()
nilfs2: fix failure to detect DAT corruption in btree and direct mappings
ocfs2: enable ocfs2_listxattr for special files
ocfs2: remove SLAB_MEM_SPREAD flag usage
assoc_array: fix the return value in assoc_array_insert_mid_shortcut()
buildid: use kmap_local_page()
watchdog/core: remove sysctl handlers from public header
nilfs2: use div64_ul() instead of do_div()
mul_u64_u64_div_u64: increase precision by conditionally swapping a and b
kexec: copy only happens before uchunk goes to zero
get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task
get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig
get_signal: don't abuse ksig->info.si_signo and ksig->sig
const_structs.checkpatch: add device_type
Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>"
dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace()
list: leverage list_is_head() for list_entry_is_head()
nilfs2: MAINTAINERS: drop unreachable project mirror site
smp: make __smp_processor_id() 0-argument macro
fat: fix uninitialized field in nostale filehandles
...
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is hotplugged
as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving policy
wherein we allocate memory across nodes in a weighted fashion rather
than uniformly. This is beneficial in heterogeneous memory environments
appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the process
has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown situations.
The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's series
"Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page faults.
He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction test",
Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
our handling of DAX on archiecctures which have virtually aliasing data
caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
improvements in worst-case mmap_lock hold times during certain
userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability improvements
in his series "Mitigate a vmap lock contention". It realizes a 12x
improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging of
large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages() to
an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which are
configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
=TG55
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series
"mm: zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is
hotplugged as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving
policy wherein we allocate memory across nodes in a weighted fashion
rather than uniformly. This is beneficial in heterogeneous memory
environments appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the
process has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown
situations. The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings"
Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's
series "Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page
faults. He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction
test", Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and
refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
in our handling of DAX on archiecctures which have virtually aliasing
data caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides
dramatic improvements in worst-case mmap_lock hold times during
certain userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability
improvements in his series "Mitigate a vmap lock contention". It
realizes a 12x improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging
of large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages()
to an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which
are configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM
Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
mm/zswap: remove the memcpy if acomp is not sleepable
crypto: introduce: acomp_is_async to expose if comp drivers might sleep
memtest: use {READ,WRITE}_ONCE in memory scanning
mm: prohibit the last subpage from reusing the entire large folio
mm: recover pud_leaf() definitions in nopmd case
selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
selftests/mm: skip uffd hugetlb tests with insufficient hugepages
selftests/mm: dont fail testsuite due to a lack of hugepages
mm/huge_memory: skip invalid debugfs new_order input for folio split
mm/huge_memory: check new folio order when split a folio
mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
mm: fix list corruption in put_pages_list
mm: remove folio from deferred split list before uncharging it
filemap: avoid unnecessary major faults in filemap_fault()
mm,page_owner: drop unnecessary check
mm,page_owner: check for null stack_record before bumping its refcount
mm: swap: fix race between free_swap_and_cache() and swapoff()
mm/treewide: align up pXd_leaf() retval across archs
mm/treewide: drop pXd_large()
...
Introduce Kconfig options to set the kernel unaligned access support.
These options provide a non-portable alternative to the runtime
unaligned access probe.
To support this, the unaligned access probing code is moved into it's
own file and gated behind a new RISCV_PROBE_UNALIGNED_ACCESS_SUPPORT
option.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Tested-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20240308-disable_misaligned_probe_config-v9-4-a388770ba0ce@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>