mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-07 13:43:51 +00:00
4a2ef47546
9385 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Alexander Gordeev
|
94fd522069 |
s390/mm: rework arch_get_mappable_range() callback
As per description in mm/memory_hotplug.c platforms should define arch_get_mappable_range() that provides maximum possible addressable physical memory range for which the linear mapping could be created. The current implementation uses VMEM_MAX_PHYS macro as the maximum mappable physical address and it is simply a cast to vmemmap. Since the address is in physical address space the natural upper limit of MAX_PHYSMEM_BITS is honoured: vmemmap_start = min(vmemmap_start, 1UL << MAX_PHYSMEM_BITS); Further, to make sure the identity mapping would not overlay with vmemmap, the size of identity mapping could be stripped like this: ident_map_size = min(ident_map_size, vmemmap_start); Similarily, any other memory that could be added (e.g DCSS segment) should not overlay with vmemmap as well and that is prevented by using vmemmap (VMEM_MAX_PHYS macro) as the upper limit. However, while the use of VMEM_MAX_PHYS brings the desired result it actually poses two issues: 1. As described, vmemmap is handled as a physical address, although it is actually a pointer to struct page in virtual address space. 2. As vmemmap is a virtual address it could have been located anywhere in the virtual address space. However, the desired necessity to honour MAX_PHYSMEM_BITS limit prevents that. Rework arch_get_mappable_range() callback in a way it does not use VMEM_MAX_PHYS macro and does not confuse the notion of virtual vs physical address spacees as result. That paves the way for moving vmemmap elsewhere and optimizing the virtual address space layout. Introduce max_mappable preserved boot variable and let function setup_kernel_memory_layout() set it up. As result, the rest of the code is does not need to know the virtual memory layout specifics. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Alexander Gordeev
|
b9b4568843 |
s390/kexec: make machine_kexec() depend on CONFIG_KEXEC_CORE
Make machine_kexec.o and relocate_kernel.o depend on CONFIG_KEXEC_CORE option as other architectures do. Still generate machine_kexec_reloc.o unconditionally, since arch_kexec_do_relocs() function is neded by the decompressor. Suggested-by: Nathan Chancellor <nathan@kernel.org> Reported-by: Nathan Chancellor <nathan@kernel.org> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Sven Schnelle
|
1256e70a08 |
s390/ftrace: enable HAVE_FUNCTION_GRAPH_RETVAL
Add support for tracing return values in the function graph tracer. This requires return_to_handler() to record gpr2 and the frame pointer Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Heiko Carstens
|
3325b4d857 |
s390/hypfs: factor out filesystem code
The s390_hypfs filesystem is deprecated and shouldn't be used due to its rather odd semantics. It creates a whole directory structure with static file contents so a user can read a consistent state while within that directory. Writing to its update attribute will remove and rebuild nearly the whole filesystem, so that again a user can read a consistent state, even if multiple files need to be read. Given that this wastes a lot of CPU cycles, and involves a lot of code, binary interfaces have been added quite a couple of years ago, which simply pass the binary data to user space, and let user space decode the data. This is the preferred and only way how the data should be retrieved. The assumption is that there are no users of the s390_hypfs filesystem. However instead of just removing the code, and having to revert in case there are actually users, factor the filesystem code out and make it only available via a new config option. This config option is supposed to be disabled. If it turns out there are no complaints the filesystem code can be removed probably in a couple of years. Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Heiko Carstens
|
b7857acc1b |
s390/hypfs: remove open-coded PTR_ALIGN()
Get rid of page_align_ptr() and use PTR_ALIGN() instead. Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Heiko Carstens
|
83f9567194 |
s390/hypfs: simplify memory allocation
Simplify memory allocation for diagnose 204 memory buffer: - allocate with __vmalloc_node() to enure page alignment - allocate real / physical memory area also within vmalloc area and handle vmalloc to real / physical address translation within diag204(). Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Heiko Carstens
|
86e74965bb |
s390/sthyi: enforce 4k alignment of vmalloc'ed area
vmalloc() does not guarantee any alignment, unless it is explicitly requested with e.g. __vmalloc_node(). Using diag204() with subcode 7 requires a 4k aligned virtual buffer. Therefore switch to __vmalloc_node(). Note: with the current vmalloc() implementation callers would still get a 4k aligned area, even though this is quite non-obvious looking at the code. So changing this in sthyi doesn't fix a real bug. It is just to make sure the code will not suffer from some obscure options, like it happened in the past with kmalloc() where debug options changed the assumed alignment of allocated memory areas. Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Heiko Carstens
|
c83cd4fe31 |
s390/diag: handle diag 204 subcode 4 address correctly
Diagnose 204 subcode 4 requires a real (physical) address, but a virtual address is passed to the inline assembly. Convert the address to a physical address for only this specific case. Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Anastasia Eskova
|
8cf57d7217 |
s390: add support for user-defined certificates
Enable receiving the user-defined certificates from the s390x hypervisor via new diagnose 0x320 calls, and make them available to the Linux root user as 'cert_store_key' type keys in a so-called 'cert_store' keyring. New user-space interfaces: /sys/firmware/cert_store/refresh Writing to this attribute re-fetches certificates via DIAG 0x320 /sys/firmware/cert_store/cs_status Reading from this attribute returns either of: "uninitialized" If no certificate has been retrieved yet "ok" If certificates have been successfully retrieved "failed (<number>)" If certificate retrieval failed with reason code <number> New debug trace areas: /sys/kernel/debug/s390dbf/cert_store_msg /sys/kernel/debug/s390dbf/cert_store_hexdump Usage example: To initiate request for certificates available to the system as root: $ echo 1 > /sys/firmware/cert_store/refresh Upon success the '/sys/firmware/cert_store/cs_status' contains the value 'ok'. $ cat /sys/firmware/cert_store/cs_status ok Get the ID of the keyring 'cert_store': $ keyctl search @us keyring cert_store OR $ keyctl link @us @s; keyctl request keyring cert_store Obtain list of IDs of certificates: $ keyctl rlist <cert_store keyring ID> Display certificate content as hex-dump: $ keyctl read <certificate ID> Read certificate contents as binary data: $ keyctl pipe <certificate ID> >cert_data Display certificate description: $ keyctl describe <certificate ID> The certificate description has the following format: <64 bytes certificate name in EBCDIC> ':' <certificate index as obtained from hypervisor> ':' <certificate store token obtained from hypervisor> The certificate description in /proc/keys has certificate name represented in ASCII. Users can read but cannot update the content of the certificate. Signed-off-by: Anastasia Eskova <anastasia.eskova@ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Linus Torvalds
|
269f4a4b85 |
ARM:
* Avoid pKVM finalization if KVM initialization fails * Add missing BTI instructions in the hypervisor, fixing an early boot failure on BTI systems * Handle MMU notifiers correctly for non hugepage-aligned memslots * Work around a bug in the architecture where hypervisor timer controls have UNKNOWN behavior under nested virt. * Disable preemption in kvm_arch_hardware_enable(), fixing a kernel BUG in cpu hotplug resulting from per-CPU accessor sanity checking. * Make WFI emulation on GICv4 systems robust w.r.t. preemption, consistently requesting a doorbell interrupt on vcpu_put() * Uphold RES0 sysreg behavior when emulating older PMU versions * Avoid macro expansion when initializing PMU register names, ensuring the tracepoints pretty-print the sysreg. s390: * Two fixes for asynchronous destroy x86 fixes will come early next week. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmS9WpwUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOhTAf9EsrnrDK2U0Q1wIGZCh/3d662yslF Kh0GidZ62w4P1O4q19lFhJ5ixVdHJjGaNrYGZm77yAi0UaYzx4wvkohdaDhIdeMg 3do2uo6/iGU5m24BaVIXlSr8V6KDsMw0UvCAjxFWNvCzpR/7tpLOteXFS9rZQ+1N jfvoVKqE6LfgJ5IZiVdhIdEOxCf/QuQD/WdZ7fib8ngkY3dETi03MkATFKchtIzx j5aWruVHQlmb5ukZzHmmNuF7Yf6c1Bs+Rt6JFjyL+DxbtPBJmHP4TepYCDS4UqIm kkxrsqiTde13jQN7vDWzfzdpLQPIGV9OnvGWQoR4dyKfDlqSxJJyhXPuBw== =Mkzl -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "ARM: - Avoid pKVM finalization if KVM initialization fails - Add missing BTI instructions in the hypervisor, fixing an early boot failure on BTI systems - Handle MMU notifiers correctly for non hugepage-aligned memslots - Work around a bug in the architecture where hypervisor timer controls have UNKNOWN behavior under nested virt - Disable preemption in kvm_arch_hardware_enable(), fixing a kernel BUG in cpu hotplug resulting from per-CPU accessor sanity checking - Make WFI emulation on GICv4 systems robust w.r.t. preemption, consistently requesting a doorbell interrupt on vcpu_put() - Uphold RES0 sysreg behavior when emulating older PMU versions - Avoid macro expansion when initializing PMU register names, ensuring the tracepoints pretty-print the sysreg s390: - Two fixes for asynchronous destroy x86 fixes will come early next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: s390: pv: fix index value of replaced ASCE KVM: s390: pv: simplify shutdown and fix race KVM: arm64: Fix the name of sys_reg_desc related to PMU KVM: arm64: Correctly handle RES0 bits PMEVTYPER<n>_EL0.evtCount KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption KVM: arm64: Add missing BTI instructions KVM: arm64: Correctly handle page aging notifiers for unaligned memslot KVM: arm64: Disable preemption in kvm_arch_hardware_enable() KVM: arm64: Handle kvm_arm_init failure correctly in finalize_pkvm KVM: arm64: timers: Use CNTHCTL_EL2 when setting non-CNTKCTL_EL1 bits |
||
Wang Ming
|
1f7e906775 |
s390/crypto: use kfree_sensitive() instead of kfree()
key might contain private part of the key, so better use kfree_sensitive() to free it. Signed-off-by: Wang Ming <machel@vivo.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Link: https://lore.kernel.org/r/20230717094533.18418-1-machel@vivo.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Sven Schnelle
|
7686762d1e |
s390/mm: fix per vma lock fault handling
With per-vma locks, handle_mm_fault() may return non-fatal error
flags. In this case the code should reset the fault flags before
returning.
Fixes:
|
||
Claudio Imbrenda
|
c2fceb59bb |
KVM: s390: pv: fix index value of replaced ASCE
The index field of the struct page corresponding to a guest ASCE should
be 0. When replacing the ASCE in s390_replace_asce(), the index of the
new ASCE should also be set to 0.
Having the wrong index might lead to the wrong addresses being passed
around when notifying pte invalidations, and eventually to validity
intercepts (VM crash) if the prefix gets unmapped and the notifier gets
called with the wrong address.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Fixes:
|
||
Claudio Imbrenda
|
5ff9218157 |
KVM: s390: pv: simplify shutdown and fix race
Simplify the shutdown of non-protected VMs. There is no need to do
complex manipulations of the counter if it was zero.
This also fixes a very rare race which caused pages to be torn down
from the address space with a non-zero counter even on older machines
that don't support the UVC instruction, causing a crash.
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes:
|
||
Jeff Layton
|
95f8020459 |
s390: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230705190309.579783-14-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
Rick Edgecombe
|
6ecc21bb43 |
mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Earlier work did the first step, so next move the callers that don't have a VMA to pte_mkwrite_novma(). Also do the same for pmd_mkwrite(). This will be ok for the shadow stack feature, as these callers are on kernel memory which will not need to be made shadow stack, and the other architectures only currently support one type of memory in pte_mkwrite() Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/all/20230613001108.3040476-3-rick.p.edgecombe%40intel.com |
||
Rick Edgecombe
|
2f0584f3f4 |
mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). The goal is to make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Start the process by renaming pte_mkwrite() to pte_mkwrite_novma() and adding the pte_mkwrite() wrapper in linux/pgtable.h. Apply the same pattern for pmd_mkwrite(). Since not all archs have a pmd_mkwrite_novma(), create a new arch config HAS_HUGE_PAGE that can be used to tell if pmd_mkwrite() should be defined. Otherwise in the !HAS_HUGE_PAGE cases the compiler would not be able to find pmd_mkwrite_novma(). No functional change. Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/lkml/CAHk-=wiZjSu7c9sFYZb3q04108stgHff2wfbokGCCgW7riz+8Q@mail.gmail.com/ Link: https://lore.kernel.org/all/20230613001108.3040476-2-rick.p.edgecombe%40intel.com |
||
Linus Torvalds
|
a452483508 |
s390 updates for 6.5 merge window part 2
- Fix virtual vs physical address confusion in vmem_add_range() and vmem_remove_range() functions. - Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h> throughout s390 code. - Make all PSW related defines also available for assembler files. Remove PSW_DEFAULT_KEY define from uapi for that. - When adding an undefined symbol the build still succeeds, but userspace crashes trying to execute VDSO, because the symbol is not resolved. Add undefined symbols check to prevent that. - Use kvmalloc_array() instead of kzalloc() for allocaton of 256k memory when executing s390 crypto adapter IOCTL. - Add -fPIE flag to prevent decompressor misaligned symbol build error with clang. - Use .balign instead of .align everywhere. This is a no-op for s390, but with this there no mix in using .align and .balign anymore. - Filter out -mno-pic-data-is-text-relative flag when compiling kernel to prevent VDSO build error. - Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS mask directly. - Do not retry administrative requests to some s390 crypto cards, since the firmware assumes replay attacks. - Remove most of the debug code, which is build in when kernel config option CONFIG_ZCRYPT_DEBUG is enabled. - Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch off the multiple devices support for the s390 zcrypt device driver. - With the conversion to generic entry machine checks are accounted to the current context instead of irq time. As result, the STCKF instruction at the beginning of the machine check handler and the lowcore member are no longer required, therefore remove it. - Fix various typos found with codespell. - Minor cleanups to CPU-measurement Counter and Sampling Facilities code. - Revert patch that removes VMEM_MAX_PHYS macro, since it causes a regression. -----BEGIN PGP SIGNATURE----- iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZKakSBccYWdvcmRlZXZA bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8HeIAQCg9RX3/olsZhCqRNLZ/O+6FXAF 29ohi2JmVqxJBKkmwgEA/QXCjoTOp41pQJ1FD39HnI8DeYpJFRnYYE5D3acibAw= =2Ykk -----END PGP SIGNATURE----- Merge tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Alexander Gordeev: - Fix virtual vs physical address confusion in vmem_add_range() and vmem_remove_range() functions - Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h> throughout s390 code - Make all PSW related defines also available for assembler files. Remove PSW_DEFAULT_KEY define from uapi for that - When adding an undefined symbol the build still succeeds, but userspace crashes trying to execute VDSO, because the symbol is not resolved. Add undefined symbols check to prevent that - Use kvmalloc_array() instead of kzalloc() for allocaton of 256k memory when executing s390 crypto adapter IOCTL - Add -fPIE flag to prevent decompressor misaligned symbol build error with clang - Use .balign instead of .align everywhere. This is a no-op for s390, but with this there no mix in using .align and .balign anymore - Filter out -mno-pic-data-is-text-relative flag when compiling kernel to prevent VDSO build error - Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS mask directly - Do not retry administrative requests to some s390 crypto cards, since the firmware assumes replay attacks - Remove most of the debug code, which is build in when kernel config option CONFIG_ZCRYPT_DEBUG is enabled - Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch off the multiple devices support for the s390 zcrypt device driver - With the conversion to generic entry machine checks are accounted to the current context instead of irq time. As result, the STCKF instruction at the beginning of the machine check handler and the lowcore member are no longer required, therefore remove it - Fix various typos found with codespell - Minor cleanups to CPU-measurement Counter and Sampling Facilities code - Revert patch that removes VMEM_MAX_PHYS macro, since it causes a regression * tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (25 commits) Revert "s390/mm: get rid of VMEM_MAX_PHYS macro" s390/cpum_sf: remove check on CPU being online s390/cpum_sf: handle casts consistently s390/cpum_sf: remove unnecessary debug statement s390/cpum_sf: remove parameter in call to pr_err s390/cpum_sf: simplify function setup_pmu_cpu s390/cpum_cf: remove unneeded debug statements s390/entry: remove mcck clock s390: fix various typos s390/zcrypt: remove ZCRYPT_MULTIDEVNODES kernel config option s390/zcrypt: do not retry administrative requests s390/zcrypt: cleanup some debug code s390/entry: rework entering DAT-on mode on CPU restart s390/mm: fence off VM macros from asm and linker s390: include linux/io.h instead of asm/io.h s390/ptrace: make all psw related defines also available for asm s390/ptrace: remove PSW_DEFAULT_KEY from uapi s390/vdso: filter out mno-pic-data-is-text-relative cflag s390: consistently use .balign instead of .align s390/decompressor: fix misaligned symbol build error ... |
||
Alexander Gordeev
|
54372cf043 |
Revert "s390/mm: get rid of VMEM_MAX_PHYS macro"
This reverts commit
|
||
Thomas Richter
|
6aca56c024 |
s390/cpum_sf: remove check on CPU being online
During sampling event initialization, a check is done if that particular CPU the event is to be installed on is actually online. This check is not necessary, as it is also performed in the system call entry point. Therefore remove this check. No functional change. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Thomas Richter
|
b2534c28b2 |
s390/cpum_sf: handle casts consistently
The casts are written in two different notations: (cast) expression and (cast)expression Convert statements with the first notation to the second notation. No functional change. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Thomas Richter
|
c13166bdb2 |
s390/cpum_sf: remove unnecessary debug statement
Remove debug_sprint_event() statement right after an pr_err() statement. No additional debug information is generated. No functional change. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Thomas Richter
|
b2ae496949 |
s390/cpum_sf: remove parameter in call to pr_err
The op argument is hardcoded in the parameter list of function pr_err. Make the op code part of the text printed by pr_err. No functional change. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Thomas Richter
|
eeeff534e9 |
s390/cpum_sf: simplify function setup_pmu_cpu
Print the error message when the FAILURE flag is set. This saves on pr_err statement as the text of the error message is identical in both failures. Also observe reverse Xmas tree variable declarations in this function. No functional change. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Thomas Richter
|
f4767f9f32 |
s390/cpum_cf: remove unneeded debug statements
Remove most debug statements which are not needed anymore from the CPU Measurement counter facility device driver. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Linus Torvalds
|
e8069f5a8e |
ARM64:
* Eager page splitting optimization for dirty logging, optionally allowing for a VM to avoid the cost of hugepage splitting in the stage-2 fault path. * Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with services that live in the Secure world. pKVM intervenes on FF-A calls to guarantee the host doesn't misuse memory donated to the hyp or a pKVM guest. * Support for running the split hypervisor with VHE enabled, known as 'hVHE' mode. This is extremely useful for testing the split hypervisor on VHE-only systems, and paves the way for new use cases that depend on having two TTBRs available at EL2. * Generalized framework for configurable ID registers from userspace. KVM/arm64 currently prevents arbitrary CPU feature set configuration from userspace, but the intent is to relax this limitation and allow userspace to select a feature set consistent with the CPU. * Enable the use of Branch Target Identification (FEAT_BTI) in the hypervisor. * Use a separate set of pointer authentication keys for the hypervisor when running in protected mode, as the host is untrusted at runtime. * Ensure timer IRQs are consistently released in the init failure paths. * Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps (FEAT_EVT), as it is a register commonly read from userspace. * Erratum workaround for the upcoming AmpereOne part, which has broken hardware A/D state management. RISC-V: * Redirect AMO load/store misaligned traps to KVM guest * Trap-n-emulate AIA in-kernel irqchip for KVM guest * Svnapot support for KVM Guest s390: * New uvdevice secret API * CMM selftest and fixes * fix racy access to target CPU for diag 9c x86: * Fix missing/incorrect #GP checks on ENCLS * Use standard mmu_notifier hooks for handling APIC access page * Drop now unnecessary TR/TSS load after VM-Exit on AMD * Print more descriptive information about the status of SEV and SEV-ES during module load * Add a test for splitting and reconstituting hugepages during and after dirty logging * Add support for CPU pinning in demand paging test * Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes included along the way * Add a "nx_huge_pages=never" option to effectively avoid creating NX hugepage recovery threads (because nx_huge_pages=off can be toggled at runtime) * Move handling of PAT out of MTRR code and dedup SVM+VMX code * Fix output of PIC poll command emulation when there's an interrupt * Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. * Misc cleanups, fixes and comments Generic: * Miscellaneous bugfixes and cleanups Selftests: * Generate dependency files so that partial rebuilds work as expected -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmSgHrIUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroORcAf+KkBlXwQMf+Q0Hy6Mfe0OtkKmh0Ae 6HJ6dsuMfOHhWv5kgukh+qvuGUGzHq+gpVKmZg2yP3h3cLHOLUAYMCDm+rjXyjsk F4DbnJLfxq43Pe9PHRKFxxSecRcRYCNox0GD5UYL4PLKcH0FyfQrV+HVBK+GI8L3 FDzUcyJkR12Lcj1qf++7fsbzfOshL0AJPmidQCoc6wkLJpUEr/nYUqlI1Kx3YNuQ LKmxFHS4l4/O/px3GKNDrLWDbrVlwciGIa3GZLS52PZdW3mAqT+cqcPcYK6SW71P m1vE80VbNELX5q3YSRoOXtedoZ3Pk97LEmz/xQAsJ/jri0Z5Syk0Ok0m/Q== =AMXp -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM64: - Eager page splitting optimization for dirty logging, optionally allowing for a VM to avoid the cost of hugepage splitting in the stage-2 fault path. - Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with services that live in the Secure world. pKVM intervenes on FF-A calls to guarantee the host doesn't misuse memory donated to the hyp or a pKVM guest. - Support for running the split hypervisor with VHE enabled, known as 'hVHE' mode. This is extremely useful for testing the split hypervisor on VHE-only systems, and paves the way for new use cases that depend on having two TTBRs available at EL2. - Generalized framework for configurable ID registers from userspace. KVM/arm64 currently prevents arbitrary CPU feature set configuration from userspace, but the intent is to relax this limitation and allow userspace to select a feature set consistent with the CPU. - Enable the use of Branch Target Identification (FEAT_BTI) in the hypervisor. - Use a separate set of pointer authentication keys for the hypervisor when running in protected mode, as the host is untrusted at runtime. - Ensure timer IRQs are consistently released in the init failure paths. - Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps (FEAT_EVT), as it is a register commonly read from userspace. - Erratum workaround for the upcoming AmpereOne part, which has broken hardware A/D state management. RISC-V: - Redirect AMO load/store misaligned traps to KVM guest - Trap-n-emulate AIA in-kernel irqchip for KVM guest - Svnapot support for KVM Guest s390: - New uvdevice secret API - CMM selftest and fixes - fix racy access to target CPU for diag 9c x86: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Drop now unnecessary TR/TSS load after VM-Exit on AMD - Print more descriptive information about the status of SEV and SEV-ES during module load - Add a test for splitting and reconstituting hugepages during and after dirty logging - Add support for CPU pinning in demand paging test - Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes included along the way - Add a "nx_huge_pages=never" option to effectively avoid creating NX hugepage recovery threads (because nx_huge_pages=off can be toggled at runtime) - Move handling of PAT out of MTRR code and dedup SVM+VMX code - Fix output of PIC poll command emulation when there's an interrupt - Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. - Misc cleanups, fixes and comments Generic: - Miscellaneous bugfixes and cleanups Selftests: - Generate dependency files so that partial rebuilds work as expected" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (153 commits) Documentation/process: Add a maintainer handbook for KVM x86 Documentation/process: Add a label for the tip tree handbook's coding style KVM: arm64: Fix misuse of KVM_ARM_VCPU_POWER_OFF bit index RISC-V: KVM: Remove unneeded semicolon RISC-V: KVM: Allow Svnapot extension for Guest/VM riscv: kvm: define vcpu_sbi_ext_pmu in header RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip RISC-V: KVM: Add in-kernel emulation of AIA APLIC RISC-V: KVM: Implement device interface for AIA irqchip RISC-V: KVM: Skeletal in-kernel AIA irqchip support RISC-V: KVM: Set kvm_riscv_aia_nr_hgei to zero RISC-V: KVM: Add APLIC related defines RISC-V: KVM: Add IMSIC related defines RISC-V: KVM: Implement guest external interrupt line management KVM: x86: Remove PRIx* definitions as they are solely for user space s390/uv: Update query for secret-UVCs s390/uv: replace scnprintf with sysfs_emit s390/uvdevice: Add 'Lock Secret Store' UVC ... |
||
Sven Schnelle
|
efccd4e0f3 |
s390/entry: remove mcck clock
In the past machine checks where accounted as irq time. With the conversion to generic entry, it was decided to account machine checks to the current context. The stckf at the beginning of the machine check handler and the lowcore member is no longer required, therefore remove it. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Heiko Carstens
|
cada938a01 |
s390: fix various typos
Fix various typos found with codespell. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
edbe289898 |
s390/entry: rework entering DAT-on mode on CPU restart
Instead of enforcing PSW_MASK_DAT bit on previously stored in lowcore restart_psw.mask use the PSW_KERNEL_BITS mask (which contains PSW_MASK_DAT) directly. As result, the PSW mask stored in lowcore is only used to enter the CPU restart routine, while PSW_KERNEL_BITS is used to enter the kernel code - similarily to commit 64ea2977add2 ("s390/mm: start kernel with DAT enabled"). Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
b492425c70 |
s390/mm: fence off VM macros from asm and linker
Prevent assembler and linker scripts compilation errors by fencing it off with __ASSEMBLY__ define. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Heiko Carstens
|
b378a98261 |
s390: include linux/io.h instead of asm/io.h
Include linux/io.h instead of asm/io.h everywhere. linux/io.h includes asm/io.h, so this shouldn't cause any problems. Instead this might help for some randconfig build errors which were reported due to some undefined io related functions. Also move the changed include so it stays grouped together with other includes from the same directory. For ctcm_mpc.c also remove not needed comments (actually questions). Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Heiko Carstens
|
b8af599977 |
s390/ptrace: make all psw related defines also available for asm
Use the _AC() macro to make all psw related defines also available for assembler files. Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Heiko Carstens
|
6376402841 |
s390/ptrace: remove PSW_DEFAULT_KEY from uapi
Move PSW_DEFAULT_KEY from uapi/asm/ptrace.h to asm/ptrace.h. This is possible, since it depends on PAGE_DEFAULT_ACC which is not part of uapi. Or in other words: this define cannot be used without error. Therefore remove it from uapi. Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Paolo Bonzini
|
a443e2609c |
* New uvdevice secret API
* New CMM selftest * cmm fix * diag 9c racy access of target cpu fix -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmSQP7EACgkQ41TmuOI4 ufi0MA//UUtbzfmKswIWFdSBKLbcLUJvLL9Ipov4HSlJCm5FSE2QsQsF4Ht0Ju6D TBNDoTCjLkCVXkC1t6DB8WNJbLnV7U2mCW5V23Mf0d16/pq47Ab/NtgZbaz5ouxd X45kYQTZ8CSl0+grsOyFj5beNlzz/SVXN2t8kIuRJdRc2ZdoE5fQqV5jKgC+8fes Opjk2lph4wBojoBNistwe4kmalU3obXNyWagXXitYYpwdYrfLC55Jxq86fJtiP06 ZhB44NcoAXpcgVNgaRCklcQpWrpTIERqcPNaLYL04zLprP+a+uBhDKyHVmgt0H2B 3WQm5bYP/kfUmgO4mwtWylDhzAhbDqCzVek9ozKh5VotsW8Ezz8jbATlazydA9iA K7ZY5++hmo48fyRn9lPfxf8Mn6cI/8vBtPlkcYtMkl+JOs40cuOfnIF1GDIgcH9X Nr2DW5QSXb+3Z4bXphIaVA1sQqq+8bejmoBPVusbUdwtfJBHTJQfHRMifrvr5MMT T9NvbyliHNTgIh95CMxNRdNinmsKRYEldvIP7Nvqj4mLMWwKEdLwD46eDIU6cuYc qUrIiIjWiyEVx2TSH0e3EVa4VgZxXui8c1tbwapQDVdWNihKjRk8TTb98BbURxtr 0phvjsInz+te1Ec5UrfviKh4aj4oK9Ce33NcOcgrZjsY82OcelQ= =0wyC -----END PGP SIGNATURE----- Merge tag 'kvm-s390-next-6.5-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD * New uvdevice secret API * New CMM selftest * cmm fix * diag 9c racy access of target cpu fix |
||
Linus Torvalds
|
9471f1f2f5 |
Merge branch 'expand-stack'
This modifies our user mode stack expansion code to always take the mmap_lock for writing before modifying the VM layout. It's actually something we always technically should have done, but because we didn't strictly need it, we were being lazy ("opportunistic" sounds so much better, doesn't it?) about things, and had this hack in place where we would extend the stack vma in-place without doing the proper locking. And it worked fine. We just needed to change vm_start (or, in the case of grow-up stacks, vm_end) and together with some special ad-hoc locking using the anon_vma lock and the mm->page_table_lock, it all was fairly straightforward. That is, it was all fine until Ruihan Li pointed out that now that the vma layout uses the maple tree code, we *really* don't just change vm_start and vm_end any more, and the locking really is broken. Oops. It's not actually all _that_ horrible to fix this once and for all, and do proper locking, but it's a bit painful. We have basically three different cases of stack expansion, and they all work just a bit differently: - the common and obvious case is the page fault handling. It's actually fairly simple and straightforward, except for the fact that we have something like 24 different versions of it, and you end up in a maze of twisty little passages, all alike. - the simplest case is the execve() code that creates a new stack. There are no real locking concerns because it's all in a private new VM that hasn't been exposed to anybody, but lockdep still can end up unhappy if you get it wrong. - and finally, we have GUP and page pinning, which shouldn't really be expanding the stack in the first place, but in addition to execve() we also use it for ptrace(). And debuggers do want to possibly access memory under the stack pointer and thus need to be able to expand the stack as a special case. None of these cases are exactly complicated, but the page fault case in particular is just repeated slightly differently many many times. And ia64 in particular has a fairly complicated situation where you can have both a regular grow-down stack _and_ a special grow-up stack for the register backing store. So to make this slightly more manageable, the bulk of this series is to first create a helper function for the most common page fault case, and convert all the straightforward architectures to it. Thus the new 'lock_mm_and_find_vma()' helper function, which ends up being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more than half the architectures, we now have more shared code and avoid some of those twisty little passages. And largely due to this common helper function, the full diffstat of this series ends up deleting more lines than it adds. That still leaves eight architectures (ia64, m68k, microblaze, openrisc, parisc, s390, sparc64 and um) that end up doing 'expand_stack()' manually because they are doing something slightly different from the normal pattern. Along with the couple of special cases in execve() and GUP. So there's a couple of patches that first create 'locked' helper versions of the stack expansion functions, so that there's a obvious path forward in the conversion. The execve() case is then actually pretty simple, and is a nice cleanup from our old "grow-up stackls are special, because at execve time even they grow down". The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because it's just more straightforward to write out the stack expansion there manually, instead od having get_user_pages_remote() do it for us in some situations but not others and have to worry about locking rules for GUP. And the final step is then to just convert the remaining odd cases to a new world order where 'expand_stack()' is called with the mmap_lock held for reading, but where it might drop it and upgrade it to a write, only to return with it held for reading (in the success case) or with it completely dropped (in the failure case). In the process, we remove all the stack expansion from GUP (where dropping the lock wouldn't be ok without special rules anyway), and add it in manually to __access_remote_vm() for ptrace(). Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases. Everything else here felt pretty straightforward, but the ia64 rules for stack expansion are really quite odd and very different from everything else. Also thanks to Vegard Nossum who caught me getting one of those odd conditions entirely the wrong way around. Anyway, I think I want to actually move all the stack expansion code to a whole new file of its own, rather than have it split up between mm/mmap.c and mm/memory.c, but since this will have to be backported to the initial maple tree vma introduction anyway, I tried to keep the patches _fairly_ minimal. Also, while I don't think it's valid to expand the stack from GUP, the final patch in here is a "warn if some crazy GUP user wants to try to expand the stack" patch. That one will be reverted before the final release, but it's left to catch any odd cases during the merge window and release candidates. Reported-by: Ruihan Li <lrh2000@pku.edu.cn> * branch 'expand-stack': gup: add warning if some caller would seem to want stack expansion mm: always expand the stack with the mmap write lock held execve: expand new process stack manually ahead of time mm: make find_extend_vma() fail if write lock not held powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma() mm/fault: convert remaining simple cases to lock_mm_and_find_vma() arm/mm: Convert to using lock_mm_and_find_vma() riscv/mm: Convert to using lock_mm_and_find_vma() mips/mm: Convert to using lock_mm_and_find_vma() powerpc/mm: Convert to using lock_mm_and_find_vma() arm64/mm: Convert to using lock_mm_and_find_vma() mm: make the page fault mmap locking killable mm: introduce new 'lock_mm_and_find_vma()' page fault helper |
||
Linus Torvalds
|
77b1a7f7a0 |
- Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in
top-level directories. - Douglas Anderson has added a new "buddy" mode to the hardlockup detector. It permits the detector to work on architectures which cannot provide the required interrupts, by having CPUs periodically perform checks on other CPUs. - Zhen Lei has enhanced kexec's ability to support two crash regions. - Petr Mladek has done a lot of cleanup on the hard lockup detector's Kconfig entries. - And the usual bunch of singleton patches in various places. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJelTAAKCRDdBJ7gKXxA juDkAP0VXWynzkXoojdS/8e/hhi+htedmQ3v2dLZD+vBrctLhAEA7rcH58zAVoWa 2ejqO6wDrRGUC7JQcO9VEjT0nv73UwU= =F293 -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level directories - Douglas Anderson has added a new "buddy" mode to the hardlockup detector. It permits the detector to work on architectures which cannot provide the required interrupts, by having CPUs periodically perform checks on other CPUs - Zhen Lei has enhanced kexec's ability to support two crash regions - Petr Mladek has done a lot of cleanup on the hard lockup detector's Kconfig entries - And the usual bunch of singleton patches in various places * tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits) kernel/time/posix-stubs.c: remove duplicated include ocfs2: remove redundant assignment to variable bit_off watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h devres: show which resource was invalid in __devm_ioremap_resource() watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64 watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h watchdog/hardlockup: make the config checks more straightforward watchdog/hardlockup: sort hardlockup detector related config values a logical way watchdog/hardlockup: move SMP barriers from common code to buddy code watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY watchdog/buddy: don't copy the cpumask in watchdog_next_cpu() watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog() watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy() watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick() watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe() watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails ... |
||
Linus Torvalds
|
6e17c6de3d |
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing. - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability. - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning. - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface. - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree. - Johannes Weiner has done some cleanup work on the compaction code. - David Hildenbrand has contributed additional selftests for get_user_pages(). - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code. - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code. - Huang Ying has done some maintenance on the swap code's usage of device refcounting. - Christoph Hellwig has some cleanups for the filemap/directio code. - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses. - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings. - John Hubbard has a series of fixes to the MM selftesting code. - ZhangPeng continues the folio conversion campaign. - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock. - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8. - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management. - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code. - Vishal Moola also has done some folio conversion work. - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh J0qXCzqaaN8+BuEyLGDVPaXur9KirwY= =B7yQ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ... |
||
Sumanth Korikkar
|
d15e4314ab |
s390/vdso: filter out mno-pic-data-is-text-relative cflag
cmd_vdso_check checks if there are any dynamic relocations in vdso64.so.dbg. When kernel is compiled with -mno-pic-data-is-text-relative, R_390_RELATIVE relocs are generated and this results in kernel build error. kpatch uses -mno-pic-data-is-text-relative option when building the kernel to prevent relative addressing between code and data. The flag avoids relocation error when klp text and data are too far apart kpatch does not patch vdso code and hence the mno-pic-data-is-text-relative flag is not essential. Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Heiko Carstens
|
27d45655fa |
s390: consistently use .balign instead of .align
The .align directive has inconsistent behavior across architectures. Use .balign instead everywhere. This is a no-op for s390, but with this there is no mix in using .align and .balign anymore. Future code is supposed to use only .balign. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Heiko Carstens
|
938f0c35d7 |
s390/decompressor: fix misaligned symbol build error
Nathan Chancellor reported a kernel build error on Fedora 39:
$ clang --version | head -1
clang version 16.0.5 (Fedora 16.0.5-1.fc39)
$ s390x-linux-gnu-ld --version | head -1
GNU ld version 2.40-1.fc39
$ make -skj"$(nproc)" ARCH=s390 CC=clang CROSS_COMPILE=s390x-linux-gnu- olddefconfig all
s390x-linux-gnu-ld: arch/s390/boot/startup.o(.text+0x5b4): misaligned symbol `_decompressor_end' (0x35b0f) for relocation R_390_PC32DBL
make[3]: *** [.../arch/s390/boot/Makefile:78: arch/s390/boot/vmlinux] Error 1
It turned out that the problem with misaligned symbols on s390 was fixed
with commit
|
||
Sven Schnelle
|
0dd0bbc200 |
s390/vdso: check for undefined symbols after build
When adding an undefined symbol the build still succeeds, but userspace is crashing trying to execute vdso because the undefined symbol is not resolved. Add the check for undefined symbols to prevent this. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Baoquan He
|
51f513fd96 |
s390/mm: do not include <asm-generic/io.h> directly
We should always include <asm/io.h> in ARCH, but not <asm-generic/io.h> directly. Otherwise, macro defined by ARCH won't be seen and could cause building error. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306100105.8GHnoMCP-lkp@intel.com/ Link: https://lore.kernel.org/all/ZIWrtFMUnRfVP5h0@MiWiFi-R3L-srv/ Signed-off-by: Baoquan He <bhe@redhat.com> [agordeev@linux.ibm.com changed patch description] Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
688fcbbb9c |
s390/vmem: fix virtual vs physical address confusion
Fix virtual vs physical address confusion (which currently are the same). Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
456be42aa7 |
s390/mm: get rid of VMEM_MAX_PHYS macro
VMEM_MAX_PHYS is supposed to be the highest physical address that can be added to the identity mapping. It should match ident_map_size, which has the same meaning. However, unlike ident_map_size it is not adjusted against various limiting factors (see the comment to setup_ident_map_size() function). That renders all checks against VMEM_MAX_PHYS invalid. Further, VMEM_MAX_PHYS is currently set to vmemmap, which is an address in virtual memory space. However, it gets compared against physical addresses in various locations. That works, because both address spaces are the same on s390, but otherwise it is wrong. Instead of fixing VMEM_MAX_PHYS misuse and semantics just remove it. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Linus Torvalds
|
582c161cf3 |
hardening updates for v6.5-rc1
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko) - Convert strreplace() to return string start (Andy Shevchenko) - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook) - Add missing function prototypes seen with W=1 (Arnd Bergmann) - Fix strscpy() kerndoc typo (Arne Welzel) - Replace strlcpy() with strscpy() across many subsystems which were either Acked by respective maintainers or were trivial changes that went ignored for multiple weeks (Azeem Shaikh) - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers) - Add KUnit tests for strcat()-family - Enable KUnit tests of FORTIFY wrappers under UML - Add more complete FORTIFY protections for strlcat() - Add missed disabling of FORTIFY for all arch purgatories. - Enable -fstrict-flex-arrays=3 globally - Tightening UBSAN_BOUNDS when using GCC - Improve checkpatch to check for strcpy, strncpy, and fake flex arrays - Improve use of const variables in FORTIFY - Add requested struct_size_t() helper for types not pointers - Add __counted_by macro for annotating flexible array size members -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbftQWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj0MD/9X9jzJzCmsAU+yNldeoAzC84Sk GVU3RBxGcTNysL1gZXynkIgigw7DWc4htMGeSABHHwQRVP65JCH1Kw/VqIkyumbx 9LdX6IklMJb4pRT4PVU3azebV4eNmSjlur2UxMeW54Czm91/6I8RHbJOyAPnOUmo 2oomGdP/hpEHtKR7hgy8Axc6w5ySwQixh2V5sVZG3VbvCS5WKTmTXbs6puuRT5hz iHt7v+7VtEg/Qf1W7J2oxfoghvVBsaRrSLrExWT/oZYh1ZxM7DsCAAoG/IsDgHGA 9LBXiRECgAFThbHVxLvvKZQMXdVk0i8iXLX43XMKC0wTA+NTyH7wlcQQ4RWNMuo8 sfA9Qm9gMArXaf64aymr3Uwn20Zan0391HdlbhOJZAE6v3PPJbleUnM58AzD2d3r 5Lz6AIFBxDImy+3f9iDWgacCT5/PkeiXTHzk9QnKhJyKKtRA58XJxj4q2+rPnGJP n4haXqoxD5FJbxdXiGKk31RS0U5HBug7wkOcUrTqDHUbc/QNU2b7dxTKUx+zYtCU uV5emPzpF4H4z+91WpO47n9gkMAfwV0lt9S2dwS8pxsgqctbmIan+Jgip7rsqZ2G OgLXBsb43eEs+6WgO8tVt/ZHYj9ivGMdrcNcsIfikzNs/xweUJ53k2xSEn2xEa5J cwANDmkL6QQK7yfeeg== =s0j1 -----END PGP SIGNATURE----- Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "There are three areas of note: A bunch of strlcpy()->strscpy() conversions ended up living in my tree since they were either Acked by maintainers for me to carry, or got ignored for multiple weeks (and were trivial changes). The compiler option '-fstrict-flex-arrays=3' has been enabled globally, and has been in -next for the entire devel cycle. This changes compiler diagnostics (though mainly just -Warray-bounds which is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_ coverage. In other words, there are no new restrictions, just potentially new warnings. Any new FORTIFY warnings we've seen have been fixed (usually in their respective subsystem trees). For more details, see commit |
||
Linus Torvalds
|
6a46676994 |
s390 updates for 6.5 merge window
- Fix the style of protected key API driver source: use x-mas tree for all local variable declarations. - Rework protected key API driver to not use the struct pkey_protkey and pkey_clrkey anymore. Both structures have a fixed size buffer, but with the support of ECC protected key these buffers are not big enough. Use dynamic buffers internally and transparently for userspace. - Add support for a new 'non CCA clear key token' with ECC clear keys supported: ECC P256, ECC P384, ECC P521, ECC ED25519 and ECC ED448. This makes it possible to derive a protected key from the ECC clear key input via PKEY_KBLOB2PROTK3 ioctl, while currently the only way to derive is via PCKMO instruction. - The s390 PMU of PAI crypto and extension 1 NNPA counters use atomic_t for reference counting. Replace this with the proper data type refcount_t. - Select ARCH_SUPPORTS_INT128, but limit this to clang for now, since gcc generates inefficient code, which may lead to stack overflows. - Replace one-element array with flexible-array member in struct vfio_ccw_parent and refactor the rest of the code accordingly. Also, prefer struct_size() over sizeof() open- coded versions. - Introduce OS_INFO_FLAGS_ENTRY pointing to a flags field and OS_INFO_FLAG_REIPL_CLEAR flag that informs a dumper whether the system memory should be cleared or not once dumped. - Fix a hang when a user attempts to remove a VFIO-AP mediated device attached to a guest: add VFIO_DEVICE_GET_IRQ_INFO and VFIO_DEVICE_SET_IRQS IOCTLs and wire up the VFIO bus driver callback to request a release of the device. - Fix calculation for R_390_GOTENT relocations for modules. - Allow any user space process with CAP_PERFMON capability read and display the CPU Measurement facility counter sets. - Rework large statically-defined per-CPU cpu_cf_events data structure and replace it with dynamically allocated structures created when a perf_event_open() system call is invoked or /dev/hwctr device is accessed. -----BEGIN PGP SIGNATURE----- iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZJsnCRccYWdvcmRlZXZA bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8D+RAQCN5CYdql/X8kOMcs4jJvDHEZf8 5CHYfmT2SLNs+zWtLgD/f/C9XXv3El/0wMBvuWSZ+T1nw+imt2cz+FJ+Nor+UQg= =R7Dm -----END PGP SIGNATURE----- Merge tag 's390-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Alexander Gordeev: - Fix the style of protected key API driver source: use x-mas tree for all local variable declarations - Rework protected key API driver to not use the struct pkey_protkey and pkey_clrkey anymore. Both structures have a fixed size buffer, but with the support of ECC protected key these buffers are not big enough. Use dynamic buffers internally and transparently for userspace - Add support for a new 'non CCA clear key token' with ECC clear keys supported: ECC P256, ECC P384, ECC P521, ECC ED25519 and ECC ED448. This makes it possible to derive a protected key from the ECC clear key input via PKEY_KBLOB2PROTK3 ioctl, while currently the only way to derive is via PCKMO instruction - The s390 PMU of PAI crypto and extension 1 NNPA counters use atomic_t for reference counting. Replace this with the proper data type refcount_t - Select ARCH_SUPPORTS_INT128, but limit this to clang for now, since gcc generates inefficient code, which may lead to stack overflows - Replace one-element array with flexible-array member in struct vfio_ccw_parent and refactor the rest of the code accordingly. Also, prefer struct_size() over sizeof() open- coded versions - Introduce OS_INFO_FLAGS_ENTRY pointing to a flags field and OS_INFO_FLAG_REIPL_CLEAR flag that informs a dumper whether the system memory should be cleared or not once dumped - Fix a hang when a user attempts to remove a VFIO-AP mediated device attached to a guest: add VFIO_DEVICE_GET_IRQ_INFO and VFIO_DEVICE_SET_IRQS IOCTLs and wire up the VFIO bus driver callback to request a release of the device - Fix calculation for R_390_GOTENT relocations for modules - Allow any user space process with CAP_PERFMON capability read and display the CPU Measurement facility counter sets - Rework large statically-defined per-CPU cpu_cf_events data structure and replace it with dynamically allocated structures created when a perf_event_open() system call is invoked or /dev/hwctr device is accessed * tag 's390-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/cpum_cf: rework PER_CPU_DEFINE of struct cpu_cf_events s390/cpum_cf: open access to hwctr device for CAP_PERFMON privileged process s390/module: fix rela calculation for R_390_GOTENT s390/vfio-ap: wire in the vfio_device_ops request callback s390/vfio-ap: realize the VFIO_DEVICE_SET_IRQS ioctl s390/vfio-ap: realize the VFIO_DEVICE_GET_IRQ_INFO ioctl s390/pkey: add support for ecc clear key s390/pkey: do not use struct pkey_protkey s390/pkey: introduce reverse x-mas trees s390/zcore: conditionally clear memory on reipl s390/ipl: add REIPL_CLEAR flag to os_info vfio/ccw: use struct_size() helper vfio/ccw: replace one-element array with flexible-array member s390: select ARCH_SUPPORTS_INT128 s390/pai_ext: replace atomic_t with refcount_t s390/pai_crypto: replace atomic_t with refcount_t |
||
Linus Torvalds
|
bc6cb4d5bc |
Locking changes for v6.5:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double(). The cmpxchg128() family of functions is basically & functionally the same as cmpxchg_double(), but with a saner interface: instead of a 6-parameter horror that forced u128 - u64/u64-halves layout details on the interface and exposed users to complexity, fragility & bugs, use a natural 3-parameter interface with u128 types. - Restructure the generated atomic headers, and add kerneldoc comments for all of the generic atomic{,64,_long}_t operations. Generated definitions are much cleaner now, and come with documentation. - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when taking multiple locks of the same type. This gets rid of one use of lockdep_set_novalidate_class() in the bcache code. - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable shadowing generating garbage code on Clang on certain ARM builds. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig 8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0 YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ 7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC x9Nt+Tp0Ze4= =DsYj -----END PGP SIGNATURE----- Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() The cmpxchg128() family of functions is basically & functionally the same as cmpxchg_double(), but with a saner interface. Instead of a 6-parameter horror that forced u128 - u64/u64-halves layout details on the interface and exposed users to complexity, fragility & bugs, use a natural 3-parameter interface with u128 types. - Restructure the generated atomic headers, and add kerneldoc comments for all of the generic atomic{,64,_long}_t operations. The generated definitions are much cleaner now, and come with documentation. - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when taking multiple locks of the same type. This gets rid of one use of lockdep_set_novalidate_class() in the bcache code. - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable shadowing generating garbage code on Clang on certain ARM builds. * tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg() locking/atomic: treewide: delete arch_atomic_*() kerneldoc locking/atomic: docs: Add atomic operations to the driver basic API documentation locking/atomic: scripts: generate kerneldoc comments docs: scripts: kernel-doc: accept bitwise negation like ~@var locking/atomic: scripts: simplify raw_atomic*() definitions locking/atomic: scripts: simplify raw_atomic_long*() definitions locking/atomic: scripts: split pfx/name/sfx/order locking/atomic: scripts: restructure fallback ifdeffery locking/atomic: scripts: build raw_atomic_long*() directly locking/atomic: treewide: use raw_atomic*_<op>() locking/atomic: scripts: add trivial raw_atomic*_<op>() locking/atomic: scripts: factor out order template generation locking/atomic: scripts: remove leftover "${mult}" locking/atomic: scripts: remove bogus order parameter locking/atomic: xtensa: add preprocessor symbols locking/atomic: x86: add preprocessor symbols locking/atomic: sparc: add preprocessor symbols locking/atomic: sh: add preprocessor symbols ... |
||
Linus Torvalds
|
ed3b7923a8 |
Scheduler changes for v6.5:
- Scheduler SMP load-balancer improvements: - Avoid unnecessary migrations within SMT domains on hybrid systems. Problem: On hybrid CPU systems, (processors with a mixture of higher-frequency SMT cores and lower-frequency non-SMT cores), under the old code lower-priority CPUs pulled tasks from the higher-priority cores if more than one SMT sibling was busy - resulting in many unnecessary task migrations. Solution: The new code improves the load balancer to recognize SMT cores with more than one busy sibling and allows lower-priority CPUs to pull tasks, which avoids superfluous migrations and lets lower-priority cores inspect all SMT siblings for the busiest queue. - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection. This improves CPU utilization for certain workloads, while leaves other key workloads unchanged. - Scheduler infrastructure improvements: - Rewrite the scheduler topology setup code by consolidating it into the build_sched_topology() helper function and building it dynamically on the fly. - Resolve the local_clock() vs. noinstr complications by rewriting the code: provide separate sched_clock_noinstr() and local_clock_noinstr() functions to be used in instrumentation code, and make sure it is all instrumentation-safe. - Fixes: - Fix a kthread_park() race with wait_woken() - Fix misc wait_task_inactive() bugs unearthed by the -rt merge: - Fix UP PREEMPT bug by unifying the SMP and UP implementations. - Fix task_struct::saved_state handling. - Fix various rq clock update bugs, unearthed by turning on the rq clock debugging code. - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by creating enough cgroups, by removing the warnign and restricting window size triggers to PSI file write-permission or CAP_SYS_RESOURCE. - Propagate SMT flags in the topology when removing degenerate domain - Fix grub_reclaim() calculation bug in the deadline scheduler code - Avoid resetting the min update period when it is unnecessary, in psi_trigger_destroy(). - Don't balance a task to its current running CPU in load_balance(), which was possible on certain NUMA topologies with overlapping groups. - Fix the sched-debug printing of rq->nr_uninterruptible - Cleanups: - Address various -Wmissing-prototype warnings, as a preparation to (maybe) enable this warning in the future. - Remove unused code - Mark more functions __init - Fix shadow-variable warnings Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSatWQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1j62xAAuGOx1LcDfRGC6WGQzp1zOdlsVQtnDvlS qL58zYSHgizprpVQ3j87SBaG4CHCdvd2Bo36yW0lNZS4nd203qdq7fkrMb3hPP/w egUQUzMegf5fF6BWldKeMjuHSt+twFQz/ZAKK8iSbAir6CHNAqbNst1oL0i/+Tyk o33hBs1hT5tnbFb1NSVZkX4k+qT3LzTW4K2QgjjGtkScr6yHh2BdEVefyigWOjdo 9s02d00ll9a2r+F5txlN7Dnw6TN7rmTXGMOJU5bZvBE90/anNiAorMXHJdEKCyUR u9+JtBdJWiCplGa/tSRcxT16ZW1VdtTnd9q66TDhXREd2UNDFqBEyg5Wl77K4Tlf vKFajmj/to+cTbuv6m6TVR+zyXpdEpdL6F04P44U3qiJvDobBqeDNKHHIqpmbHXl AXUXcPWTVAzXX1Ce5M+BeAgTBQ1T7C5tELILrTNQHJvO1s9VVBRFZ/l65Ps4vu7T wIZ781IFuopk0zWqHovNvgKrJ7oFmOQQZFttQEe8n6nafkjI7u+IZ8FayiGaUMRr 4GawFGUCEdYh8z9qyslGKe8Q/Rphfk6hxMFRYUJpDmubQ0PkMeDjDGq77jDGl1PF VqwSDEyOaBJs7Gqf/mem00JtzBmXhkhm1SEjggHMI2IQbr/eeBXoLQOn3CDapO/N PiDbtX760ic= =EWQA -----END PGP SIGNATURE----- Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scheduler SMP load-balancer improvements: - Avoid unnecessary migrations within SMT domains on hybrid systems. Problem: On hybrid CPU systems, (processors with a mixture of higher-frequency SMT cores and lower-frequency non-SMT cores), under the old code lower-priority CPUs pulled tasks from the higher-priority cores if more than one SMT sibling was busy - resulting in many unnecessary task migrations. Solution: The new code improves the load balancer to recognize SMT cores with more than one busy sibling and allows lower-priority CPUs to pull tasks, which avoids superfluous migrations and lets lower-priority cores inspect all SMT siblings for the busiest queue. - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection. This improves CPU utilization for certain workloads, while leaves other key workloads unchanged. Scheduler infrastructure improvements: - Rewrite the scheduler topology setup code by consolidating it into the build_sched_topology() helper function and building it dynamically on the fly. - Resolve the local_clock() vs. noinstr complications by rewriting the code: provide separate sched_clock_noinstr() and local_clock_noinstr() functions to be used in instrumentation code, and make sure it is all instrumentation-safe. Fixes: - Fix a kthread_park() race with wait_woken() - Fix misc wait_task_inactive() bugs unearthed by the -rt merge: - Fix UP PREEMPT bug by unifying the SMP and UP implementations - Fix task_struct::saved_state handling - Fix various rq clock update bugs, unearthed by turning on the rq clock debugging code. - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by creating enough cgroups, by removing the warnign and restricting window size triggers to PSI file write-permission or CAP_SYS_RESOURCE. - Propagate SMT flags in the topology when removing degenerate domain - Fix grub_reclaim() calculation bug in the deadline scheduler code - Avoid resetting the min update period when it is unnecessary, in psi_trigger_destroy(). - Don't balance a task to its current running CPU in load_balance(), which was possible on certain NUMA topologies with overlapping groups. - Fix the sched-debug printing of rq->nr_uninterruptible Cleanups: - Address various -Wmissing-prototype warnings, as a preparation to (maybe) enable this warning in the future. - Remove unused code - Mark more functions __init - Fix shadow-variable warnings" * tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle() sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() sched/core: Fixed missing rq clock update before calling set_rq_offline() sched/deadline: Update GRUB description in the documentation sched/deadline: Fix bandwidth reclaim equation in GRUB sched/wait: Fix a kthread_park race with wait_woken() sched/topology: Mark set_sched_topology() __init sched/fair: Rename variable cpu_util eff_util arm64/arch_timer: Fix MMIO byteswap sched/fair, cpufreq: Introduce 'runnable boosting' sched/fair: Refactor CPU utilization functions cpuidle: Use local_clock_noinstr() sched/clock: Provide local_clock_noinstr() x86/tsc: Provide sched_clock_noinstr() clocksource: hyper-v: Provide noinstr sched_clock() clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX x86/vdso: Fix gettimeofday masking math64: Always inline u128 version of mul_u64_u64_shr() s390/time: Provide sched_clock_noinstr() loongarch: Provide noinstr sched_clock_read() ... |
||
Linus Torvalds
|
8d7071af89 |
mm: always expand the stack with the mmap write lock held
This finishes the job of always holding the mmap write lock when extending the user stack vma, and removes the 'write_locked' argument from the vm helper functions again. For some cases, we just avoid expanding the stack at all: drivers and page pinning really shouldn't be extending any stacks. Let's see if any strange users really wanted that. It's worth noting that architectures that weren't converted to the new lock_mm_and_find_vma() helper function are left using the legacy "expand_stack()" function, but it has been changed to drop the mmap_lock and take it for writing while expanding the vma. This makes it fairly straightforward to convert the remaining architectures. As a result of dropping and re-taking the lock, the calling conventions for this function have also changed, since the old vma may no longer be valid. So it will now return the new vma if successful, and NULL - and the lock dropped - if the area could not be extended. Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64 Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
64bf6ae93e |
v6.5/vfs.misc
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4SwAKCRCRxhvAZXjc ojOTAP9gT/z1gasIf8OwDHb4inZGnVpHh2ApKLvgMXH6ICtwRgD+OBtOcf438Lx1 cpFSTVJlh21QXMOOXWHe/LRUV2kZ5wI= =zdfx -----END PGP SIGNATURE----- Merge tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Miscellaneous features, cleanups, and fixes for vfs and individual fs Features: - Use mode 0600 for file created by cachefilesd so it can be run by unprivileged users. This aligns them with directories which are already created with mode 0700 by cachefilesd - Reorder a few members in struct file to prevent some false sharing scenarios - Indicate that an eventfd is used a semaphore in the eventfd's fdinfo procfs file - Add a missing uapi header for eventfd exposing relevant uapi defines - Let the VFS protect transitions of a superblock from read-only to read-write in addition to the protection it already provides for transitions from read-write to read-only. Protecting read-only to read-write transitions allows filesystems such as ext4 to perform internal writes, keeping writers away until the transition is completed Cleanups: - Arnd removed the architecture specific arch_report_meminfo() prototypes and added a generic one into procfs.h. Note, we got a report about a warning in amdpgpu codepaths that suggested this was bisectable to this change but we concluded it was a false positive - Remove unused parameters from split_fs_names() - Rename put_and_unmap_page() to unmap_and_put_page() to let the name reflect the order of the cleanup operation that has to unmap before the actual put - Unexport buffer_check_dirty_writeback() as it is not used outside of block device aops - Stop allocating aio rings from highmem - Protecting read-{only,write} transitions in the VFS used open-coded barriers in various places. Replace them with proper little helpers and document both the helpers and all barrier interactions involved when transitioning between read-{only,write} states - Use flexible array members in old readdir codepaths Fixes: - Use the correct type __poll_t for epoll and eventfd - Replace all deprecated strlcpy() invocations, whose return value isn't checked with an equivalent strscpy() call - Fix some kernel-doc warnings in fs/open.c - Reduce the stack usage in jffs2's xattr codepaths finally getting rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] royally annoying compilation warning - Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not fmode_t is required to avoid fmode_t to integer degradation warnings - Create coredumps with O_WRONLY instead of O_RDWR. There's a long explanation in that commit how O_RDWR is actually a bug which we found out with the help of Linus and git archeology - Fix "no previous prototype" warnings in the pipe codepaths - Add overflow calculations for remap_verify_area() as a signed addition overflow could be triggered in xfstests - Fix a null pointer dereference in sysv - Use an unsigned variable for length calculations in jfs avoiding compilation warnings with gcc 13 - Fix a dangling pipe pointer in the watch queue codepath - The legacy mount option parser provided as a fallback by the VFS for filesystems not yet converted to the new mount api did prefix the generated mount option string with a leading ',' causing issues for some filesystems - Fix a repeated word in a comment in fs.h - autofs: Update the ctime when mtime is updated as mandated by POSIX" * tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) readdir: Replace one-element arrays with flexible-array members fs: Provide helpers for manipulating sb->s_readonly_remount fs: Protect reconfiguration of sb read-write from racing writes eventfd: add a uapi header for eventfd userspace APIs autofs: set ctime as well when mtime changes on a dir eventfd: show the EFD_SEMAPHORE flag in fdinfo fs/aio: Stop allocating aio rings from HIGHMEM fs: Fix comment typo fs: unexport buffer_check_dirty_writeback fs: avoid empty option when generating legacy mount string watch_queue: prevent dangling pipe pointer fs.h: Optimize file struct to prevent false sharing highmem: Rename put_and_unmap_page() to unmap_and_put_page() cachefiles: Allow the cache to be non-root init: remove unused names parameter in split_fs_names() jfs: Use unsigned variable for length calculations fs/sysv: Null check to prevent null-ptr-deref bug fs: use UB-safe check for signed addition overflow in remap_verify_area procfs: consolidate arch_report_meminfo declaration fs: pipe: reveal missing function protoypes ... |
||
Linus Torvalds
|
9d9a9bf07e |
s390 updates for 6.5
- Use correct type for size of memory allocated for ELF core header on kernel crash. - Fix insecure W+X mapping warning when KASAN shadow memory range is not aligned on page boundary. - Avoid allocation of short by one page KASAN shadow memory when the original memory range is less than (PAGE_SIZE << 3). - Fix virtual vs physical address confusion in physical memory enumerator. It is not a real issue, since virtual and physical addresses are currently the same. - Set CONFIG_NET_TC_SKB_EXT=y in s390 config files as it is required for offloading TC as well as bridges on switchdev capable ConnectX devices. -----BEGIN PGP SIGNATURE----- iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZJWgRRccYWdvcmRlZXZA bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8LuPAQDaL4mjPxKvleuiTeH1hf4id48X i5UZSf6mAirKwyo4WAEA0sAIvhdJlC8+MBEC1QtYkIJyoxmSg6AsMH5MJL+61wA= =mJ2Q -----END PGP SIGNATURE----- Merge tag 's390-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Alexander Gordeev: - Use correct type for size of memory allocated for ELF core header on kernel crash. - Fix insecure W+X mapping warning when KASAN shadow memory range is not aligned on page boundary. - Avoid allocation of short by one page KASAN shadow memory when the original memory range is less than (PAGE_SIZE << 3). - Fix virtual vs physical address confusion in physical memory enumerator. It is not a real issue, since virtual and physical addresses are currently the same. - Set CONFIG_NET_TC_SKB_EXT=y in s390 config files as it is required for offloading TC as well as bridges on switchdev capable ConnectX devices. * tag 's390-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/defconfigs: set CONFIG_NET_TC_SKB_EXT=y s390/boot: fix physmem_info virtual vs physical address confusion s390/kasan: avoid short by one page shadow memory s390/kasan: fix insecure W+X mapping warning s390/crash: use the correct type for memory allocation |
||
Niklas Schnelle
|
ad3d770b83 |
s390/defconfigs: set CONFIG_NET_TC_SKB_EXT=y
As made explicit by commit
|
||
Thomas Richter
|
9b9cf3c77e |
s390/cpum_cf: rework PER_CPU_DEFINE of struct cpu_cf_events
Struct cpu_cf_events is a large data structure and is statically defined for each possible CPU. Rework this and replace it by dynamically allocated data structures created when a perf_event_open() system call is invoked or an access via character device /dev/hwctr takes place. It is replaced by an array of pointers to all possible CPUs and reference counting. The array of pointers is allocated when the first event is created. For each online CPU an event is installed on, a struct cpu_cf_events is allocated and a pointer to struct cpu_cf_events is stored in the array: CPU 0 1 2 3 ... N +---+---+---+---+---+---+ cpu_cf_root::cpucf--> | * | | | |...| | +-|-+---+---+---+---+---+ | | \|/ +-------------+ |cpu_cf_events| | | +-------------+ With this approach the large data structure is only allocated when an event is actually installed and used. Also implement proper reference counting for allocation and removal. During interrupt processing make sure the pointer to cpu_cf_events is valid. The interrupt handler is shared and might be called when no event is active. This requires checking for a valid pointer to struct cpu_cf_events. When the pointer to the per-cpu cpu_cf_events is NULL, simply return. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Thomas Richter
|
d0d3e218d5 |
s390/cpum_cf: open access to hwctr device for CAP_PERFMON privileged process
The device /dev/hwctr was introduced to access complete CPU Measurement facility counter sets via an ioctl system call. The access the to device is limited to privileged processes running as root or superuser. The capability CAP_SYS_ADMIN is required. The device permissions are read/write for the device owner root. There is no need for this restriction. Make the device access permission read/write for all and reduce the capabilities to CAP_PERFMON. Any user space program with the CAP_PERFMON capability assigned to it can now read and display the CPU Measurement facility counter sets. For more details on perf tool usage and security, see linux documentation in Documentation/admin-guide/perf-security.rst. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Sumanth Korikkar
|
11458e2b3f |
s390/module: fix rela calculation for R_390_GOTENT
During module load, module layout allocation occurs by initially allowing the architecture to frob the sections. This is performed via module_frob_arch_sections(). However, the size of each module memory types like text,data,rodata etc are updated correctly only after layout_sections(). After calculation of required module memory sizes for each types, move_module() is responsible for allocating the module memory for each type from modules vaddr range. Considering the sequence above, module_frob_arch_sections() updates the module mod_arch_specific got_offset before module memory text type size is fully updated in layout_sections(). Hence mod_arch_specific got_offset points to currently zero. As per s390 ABI, R_390_GOTENT : (G + O + A - P) >> 1 where G=me->mem[MOD_TEXT].base+me->arch.got_offset O=info->got_offset A=rela->r_addend P=loc fix R_390_GOTENT calculation in apply_rela(). Note: currently this doesn't break anything because me->arch.got_offset is zero. However, reordering of functions in the future could break it. Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
c70505434c |
s390/boot: fix physmem_info virtual vs physical address confusion
Fix virtual vs physical address confusion (which currently are the same). Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
3e8261003b |
s390/kasan: avoid short by one page shadow memory
Kernel Address Sanitizer uses 3 bits per byte to encode memory. That is the number of bits the start and end address of a memory range is shifted right when the corresponding shadow memory is created for that memory range. The used memory mapping routine expects page-aligned addresses, while the above described 3-bit shift might turn the shadow memory range start and end boundaries into non-page-aligned in case the size of the original memory range is less than (PAGE_SIZE << 3). As result, the resulting shadow memory range could be short on one page. Align on page boundary the start and end addresses when mapping a shadow memory range and avoid the described issue in the future. Note, that does not fix a real problem, since currently no virtual regions of size less than (PAGE_SIZE << 3) exist. Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
2ed8b50975 |
s390/kasan: fix insecure W+X mapping warning
Since commit 3b5c3f000c2e ("s390/kasan: move shadow mapping
to decompressor") the decompressor establishes mappings for
the shadow memory and sets initial protection attributes to
RWX. The decompressed kernel resets protection to RW+NX
later on.
In case a shadow memory range is not aligned on page boundary
(e.g. as result of mem= kernel command line parameter use),
the "Checked W+X mappings: FAILED, 1 W+X pages found" warning
hits.
Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Fixes:
|
||
Christophe JAILLET
|
f471c6585c |
s390/crash: use the correct type for memory allocation
get_elfcorehdr_size() returns a size_t, so there is no real point to store it in a u32. Turn 'alloc_size' into a size_t. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/0756118c9058338f3040edb91971d0bfd100027b.1686688212.git.christophe.jaillet@wanadoo.fr Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Hugh Dickins
|
b2f58941ad |
s390: gmap use pte_unmap_unlock() not spin_unlock()
pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(), and use pte_unmap_unlock() instead of direct spin_unlock() (even though ptep ends up unused inside the macro). Link: https://lkml.kernel.org/r/78873af-e1ec-4f9-47ac-483940ac6daa@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: John David Anglin <dave.anglin@bell.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
5c7f3bf04a |
s390: allow pte_offset_map_lock() to fail
In rare transient cases, not yet made possible, pte_offset_map() and pte_offset_map_lock() may not find a page table: handle appropriately. Add comment on mm's contract with s390 above __zap_zero_pages(), and fix old comment there: must be called after THP was disabled. Link: https://lkml.kernel.org/r/3ff29363-336a-9733-12a1-5c31a45c8aeb@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: John David Anglin <dave.anglin@bell.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Steffen Eiden
|
db54dfc9f7 |
s390/uv: Update query for secret-UVCs
Update the query struct such that secret-UVC related information can be parsed. Add sysfs files for these new values. 'supp_add_secret_req_ver' notes the supported versions for the Add Secret UVC. Bit 0 indicates that version 0x100 is supported, bit 1 indicates 0x200, and so on. 'supp_add_secret_pcf' notes the supported plaintext flags for the Add Secret UVC. 'supp_secret_types' notes the supported types of secrets. Bit 0 indicates secret type 1, bit 1 indicates type 2, and so on. 'max_secrets' notes the maximum amount of secrets the secret store can store per pv guest. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230615100533.3996107-8-seiden@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20230615100533.3996107-8-seiden@linux.ibm.com> |
||
Steffen Eiden
|
78d3326e72 |
s390/uv: replace scnprintf with sysfs_emit
Replace scnprintf(page, PAGE_SIZE, ...) with the page size aware sysfs_emit(buf, ...) which adds some sanity checks. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230615100533.3996107-7-seiden@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20230615100533.3996107-7-seiden@linux.ibm.com> |
||
Steffen Eiden
|
2d8a26acaf |
s390/uvdevice: Add 'Lock Secret Store' UVC
Userspace can call the Lock Secret Store Ultravisor Call using IOCTLs on the uvdevice. The Lock Secret Store UV call disables all additions of secrets for the future. The uvdevice is merely transporting the request from userspace to the Ultravisor. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230615100533.3996107-6-seiden@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20230615100533.3996107-6-seiden@linux.ibm.com> |
||
Steffen Eiden
|
b96b3ce272 |
s390/uvdevice: Add 'List Secrets' UVC
Userspace can call the List Secrets Ultravisor Call using IOCTLs on the uvdevice. The List Secrets UV call lists the identifier of the secrets in the UV secret store. The uvdevice is merely transporting the request from userspace to Ultravisor. It's neither checking nor manipulating the request or response data. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230615100533.3996107-5-seiden@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20230615100533.3996107-5-seiden@linux.ibm.com> |
||
Steffen Eiden
|
44567ca21a |
s390/uvdevice: Add 'Add Secret' UVC
Userspace can call the Add Secret Ultravisor Call using IOCTLs on the uvdevice. The Add Secret UV call sends an encrypted and cryptographically verified request to the Ultravisor. The request inserts a protected guest's secret into the Ultravisor for later use. The uvdevice is merely transporting the request from userspace to the Ultravisor. It's neither checking nor manipulating the request data. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230615100533.3996107-4-seiden@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20230615100533.3996107-4-seiden@linux.ibm.com> |
||
Steffen Eiden
|
ea9d971635 |
s390/uvdevice: Add info IOCTL
Add an IOCTL that allows userspace to find out which IOCTLs the uvdevice supports without trial and error. Explicitly expose the IOCTL nr for the request types. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230615100533.3996107-3-seiden@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20230615100533.3996107-3-seiden@linux.ibm.com> |
||
Steffen Eiden
|
4255ce0177 |
s390/uv: Always export uv_info
KVM needs the struct's values to be able to provide PV support. The uvdevice is currently guest only and will need the struct's values for call support checking and potential future expansions. As uv.c is only compiled with CONFIG_PGSTE or CONFIG_PROTECTED_VIRTUALIZATION_GUEST we don't need a second check in the code. Users of uv_info will need to fence for these two config options for the time being. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230615100533.3996107-2-seiden@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20230615100533.3996107-2-seiden@linux.ibm.com> |
||
Christian Borntraeger
|
0bc380beb7 |
KVM: s390/diag: fix racy access of physical cpu number in diag 9c handler
We do check for target CPU == -1, but this might change at the time we
are going to use it. Hold the physical target CPU in a local variable to
avoid out-of-bound accesses to the cpu arrays.
Cc: Pierre Morel <pmorel@linux.ibm.com>
Fixes:
|
||
Pierre Morel
|
246be7d272 |
KVM: s390: vsie: fix the length of APCB bitmap
bit_and() uses the count of bits as the woking length. Fix the previous implementation and effectively use the right bitmap size. Fixes: |
||
Nico Boehr
|
285cff4c04 |
KVM: s390: fix KVM_S390_GET_CMMA_BITS for GFNs in memslot holes
The KVM_S390_GET_CMMA_BITS ioctl may return incorrect values when userspace
specifies a start_gfn outside of memslots.
This can occur when a VM has multiple memslots with a hole in between:
+-----+----------+--------+--------+
| ... | Slot N-1 | <hole> | Slot N |
+-----+----------+--------+--------+
^ ^ ^ ^
| | | |
GFN A A+B | |
A+B+C |
A+B+C+D
When userspace specifies a GFN in [A+B, A+B+C), it would expect to get the
CMMA values of the first dirty page in Slot N. However, userspace may get a
start_gfn of A+B+C+D with a count of 0, hence completely skipping over any
dirty pages in slot N.
The error is in kvm_s390_next_dirty_cmma(), which assumes
gfn_to_memslot_approx() will return the memslot _below_ the specified GFN
when the specified GFN lies outside a memslot. In reality it may return
either the memslot below or above the specified GFN.
When a memslot above the specified GFN is returned this happens:
- ofs is calculated, but since the memslot's base_gfn is larger than the
specified cur_gfn, ofs will underflow to a huge number.
- ofs is passed to find_next_bit(). Since ofs will exceed the memslot's
number of pages, the number of pages in the memslot is returned,
completely skipping over all bits in the memslot userspace would be
interested in.
Fix this by resetting ofs to zero when a memslot _above_ cur_gfn is
returned (cur_gfn < ms->base_gfn).
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes:
|
||
Arnd Bergmann
|
af0a76e126 |
thread_info: move function declarations to linux/thread_info.h
There are a few __weak functions in kernel/fork.c, which architectures can override. If there is no prototype, the compiler warns about them: kernel/fork.c:164:13: error: no previous prototype for 'arch_release_task_struct' [-Werror=missing-prototypes] kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes] kernel/fork.c:1086:12: error: no previous prototype for 'arch_dup_task_struct' [-Werror=missing-prototypes] There are already prototypes in a number of architecture specific headers that have addressed those warnings before, but it's much better to have these in a single place so the warning no longer shows up anywhere. Link: https://lkml.kernel.org/r/20230517131102.934196-14-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Paris <eparis@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Arnd Bergmann
|
ad1a48301f |
init: consolidate prototypes in linux/init.h
The init/main.c file contains some extern declarations for functions defined in architecture code, and it defines some other functions that are called from architecture code with a custom prototype. Both of those result in warnings with 'make W=1': init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes] init/main.c:790:20: error: no previous prototype for 'mem_encrypt_init' [-Werror=missing-prototypes] init/main.c:792:20: error: no previous prototype for 'poking_init' [-Werror=missing-prototypes] arch/arm64/kernel/irq.c:122:13: error: no previous prototype for 'init_IRQ' [-Werror=missing-prototypes] arch/arm64/kernel/time.c:55:13: error: no previous prototype for 'time_init' [-Werror=missing-prototypes] arch/x86/kernel/process.c:935:13: error: no previous prototype for 'arch_post_acpi_subsys_init' [-Werror=missing-prototypes] init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes] kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes] Add prototypes for all of these in include/linux/init.h or another appropriate header, and remove the duplicate declarations from architecture specific code. [sfr@canb.auug.org.au: declare time_init_early()] Link: https://lkml.kernel.org/r/20230519124311.5167221c@canb.auug.org.au Link: https://lkml.kernel.org/r/20230517131102.934196-12-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Paris <eparis@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
ca5e863233 |
mm/gup: remove vmas parameter from get_user_pages_remote()
The only instances of get_user_pages_remote() invocations which used the vmas parameter were for a single page which can instead simply look up the VMA directly. In particular:- - __update_ref_ctr() looked up the VMA but did nothing with it so we simply remove it. - __access_remote_vm() was already using vma_lookup() when the original lookup failed so by doing the lookup directly this also de-duplicates the code. We are able to perform these VMA operations as we already hold the mmap_lock in order to be able to call get_user_pages_remote(). As part of this work we add get_user_page_vma_remote() which abstracts the VMA lookup, error handling and decrementing the page reference count should the VMA lookup fail. This forms part of a broader set of patches intended to eliminate the vmas parameter altogether. [akpm@linux-foundation.org: avoid passing NULL to PTR_ERR] Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64) Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390) Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Nhat Pham
|
946e697c69 |
cachestat: wire up cachestat for other architectures
cachestat is previously only wired in for x86 (and architectures using the generic unistd.h table): https://lore.kernel.org/lkml/20230503013608.2431726-1-nphamcs@gmail.com/ This patch wires cachestat in for all the other architectures. [nphamcs@gmail.com: wire up cachestat for arm64] Link: https://lkml.kernel.org/r/20230511092843.3896327-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230510195806.2902878-1-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
87aceaa7f0 |
s390 updates for 6.4-rc6
- Avoid linker error for randomly generated config file that has CONFIG_BRANCH_PROFILE_NONE enabled and make it similar to riscv, x86 and also to commit |
||
Alexander Gordeev
|
e23b4fdb5c |
Merge branch 'protected-key' into features
Harald Freudenberger says: =================== This patches do some cleanup and reorg of the pkey module code and extend the existing ioctl with supporting derivation of protected key material from clear key material for some ECC curves with the help of the PCKMO instruction. Please note that 'protected key' is a special type of key only available on s390. It is similar to an secure key which is encrypted by a master key sitting inside an HSM. In contrast to secure keys a protected key is encrypted by a random key located in a hidden firmware memory accessible by the CPU and thus much faster but less secure. =================== The merged updates are: - Fix the style of protected key API driver source: use x-mas tree for all local variable declarations. - Rework protected key API driver to not use the struct pkey_protkey and pkey_clrkey anymore. Both structures have a fixed size buffer, but with the support of ECC protected key these buffers are not big enough. Use dynamic buffers internally and transparently for userspace. - Add support for a new 'non CCA clear key token' with ECC clear keys supported: ECC P256, ECC P384, ECC P521, ECC ED25519 and ECC ED448. This makes it possible to derive a protected key from the ECC clear key input via PKEY_KBLOB2PROTK3 ioctl, while currently the only way to derive is via PCKMO instruction. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Peter Zijlstra
|
91b41a2375 |
s390/time: Provide sched_clock_noinstr()
With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by providing a noinstr sched_clock_noinstr() function. Specifically, preempt_enable_*() calls out to schedule(), which upsets noinstr validation efforts. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.570170436@infradead.org |
||
Peter Zijlstra
|
497cc42bf5 |
s390/cpum_sf: Convert to cmpxchg128()
Now that there is a cross arch u128 and cmpxchg128(), use those instead of the custom CDSG helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132324.058821078@infradead.org |
||
Peter Zijlstra
|
febe950dbf |
arch: Remove cmpxchg_double
No moar users, remove the monster. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.991907085@infradead.org |
||
Peter Zijlstra
|
6d12c8d308 |
percpu: Wire up cmpxchg128
In order to replace cmpxchg_double() with the newly minted cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.654945124@infradead.org |
||
Peter Zijlstra
|
b23e139d0b |
arch: Introduce arch_{,try_}_cmpxchg128{,_local}()
For all architectures that currently support cmpxchg_double() implement the cmpxchg128() family of functions that is basically the same but with a saner interface. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.452120708@infradead.org |
||
Kees Cook
|
4ce1e94175 |
s390/purgatory: Do not use fortified string functions
With the addition of -fstrict-flex-arrays=3, struct sha256_state's
trailing array is no longer ignored by CONFIG_FORTIFY_SOURCE:
struct sha256_state {
u32 state[SHA256_DIGEST_SIZE / 4];
u64 count;
u8 buf[SHA256_BLOCK_SIZE];
};
This means that the memcpy() calls with "buf" as a destination in
sha256.c's code will attempt to perform run-time bounds checking, which
could lead to calling missing functions, specifically a potential
WARN_ONCE, which isn't callable from purgatory.
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Closes: https://lore.kernel.org/lkml/175578ec-9dec-7a9c-8d3a-43f24ff86b92@leemhuis.info/
Bisected-by: "Joan Bruguera Micó" <joanbrugueram@gmail.com>
Fixes:
|
||
Harald Freudenberger
|
9e436c195e |
s390/pkey: add support for ecc clear key
Add support for a new 'non CCA clear key token' with these ECC clear keys supported: - ECC P256 - ECC P384 - ECC P521 - ECC ED25519 - ECC ED448 This makes it possible to derive a protected key from this ECC clear key input via PKEY_KBLOB2PROTK3 ioctl. As of now the only way to derive protected keys from these clear key tokens is via PCKMO instruction. For AES keys an alternate path via creating a secure key from the clear key and then derive a protected key from the secure key exists. This alternate path is not implemented for ECC keys as it would require to rearrange and maybe recalculate the clear key material for input to derive an CCA or EP11 ECC secure key. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Harald Freudenberger
|
f370f45c64 |
s390/pkey: do not use struct pkey_protkey
This is an internal rework of the pkey code to not use the struct pkey_protkey internal any more. This struct has a hard coded protected key buffer with MAXPROTKEYSIZE = 64 bytes. However, with support for ECC protected key, this limit is too short and thus this patch reworks all the internal code to use the triple u8 *protkey, u32 protkeylen, u32 protkeytype instead. So the ioctl which still has to deal with this struct coming from userspace and/or provided to userspace invoke all the internal functions now with the triple instead of passing a pointer to struct pkey_protkey. Also the struct pkey_clrkey has been internally replaced in a similar way. This struct also has a hard coded clear key buffer of MAXCLRKEYSIZE = 32 bytes and thus is not usable with e.g. ECC clear key material. This is a transparent rework for userspace applications using the pkey API. The internal kernel API used by the PAES crypto ciphers has been adapted to this change to make it possible to provide ECC protected keys via this interface in the future. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Mikhail Zaslonko
|
31e9ccc67c |
s390/ipl: add REIPL_CLEAR flag to os_info
Introduce new OS_INFO_FLAGS_ENTRY to os_info pointing to the field with bit flags. Add OS_INFO_FLAGS_ENTRY upon dump_reipl shutdown action processing and set OS_INFO_FLAG_REIPL_CLEAR flag indicating 'clear' sysfs attribute has been set on the panicked system for specified ipl type. This flag can be used to inform the dumper whether LOAD_CLEAR or LOAD_NORMAL diag308 subcode to be used for ipl after dumping the memory. Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
03c5c83b70 |
s390/purgatory: disable branch profiling
Avoid linker error for randomly generated config file that
has CONFIG_BRANCH_PROFILE_NONE enabled and make it similar
to riscv, x86 and also to commit
|
||
Linus Torvalds
|
ac92c27935 |
s390 updates for 6.4-rc3
- Add check whether the required facilities are installed before using the s390-specific ChaCha20 implementation. - Key blobs for s390 protected key interface IOCTLs commands PKEY_VERIFYKEY2 and PKEY_VERIFYKEY3 may contain clear key material. Zeroize copies of these keys in kernel memory after creating protected keys. - Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid extra overhead of initializing all stack variables by default. - Make sure that when a new channel-path is enabled all subchannels are evaluated: with and without any devices connected on it. - When SMT thread CPUs are added to CPU topology masks the nr_cpu_ids limit is not checked and could be exceeded. Respect the nr_cpu_ids limit and avoid a warning when CONFIG_DEBUG_PER_CPU_MAPS is set. - The pointer to IPL Parameter Information Block is stored in the absolute lowcore as a virtual address. Save it as the physical address for later use by dump tools. - Fix a Queued Direct I/O (QDIO) problem on z/VM guests using QIOASSIST with dedicated (pass through) QDIO-based devices such as FCP, real OSA or HiperSockets. - s390's struct statfs and struct statfs64 contain padding, which field-by-field copying does not set. Initialize the respective structures with zeros before filling them and copying to userspace. - Grow s390 compat_statfs64, statfs and statfs64 structures f_spare array member to cover padding and simplify things. - Remove obsolete SCHED_BOOK and SCHED_DRAWER configs. - Remove unneeded S390_CCW_IOMMU and S390_AP_IOM configs. -----BEGIN PGP SIGNATURE----- iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZGd5BRccYWdvcmRlZXZA bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8OqMAQCsdBG7eR3dp3mY8ao34dqlWt98 rDQD8oiMgCkFyn77jQEAoo3HhqWY8oTu88fl82dkF0OpGW+7zgoNHUYhH8Z0gAY= =wtTO -----END PGP SIGNATURE----- Merge tag 's390-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Alexander Gordeev: - Add check whether the required facilities are installed before using the s390-specific ChaCha20 implementation - Key blobs for s390 protected key interface IOCTLs commands PKEY_VERIFYKEY2 and PKEY_VERIFYKEY3 may contain clear key material. Zeroize copies of these keys in kernel memory after creating protected keys - Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid extra overhead of initializing all stack variables by default - Make sure that when a new channel-path is enabled all subchannels are evaluated: with and without any devices connected on it - When SMT thread CPUs are added to CPU topology masks the nr_cpu_ids limit is not checked and could be exceeded. Respect the nr_cpu_ids limit and avoid a warning when CONFIG_DEBUG_PER_CPU_MAPS is set - The pointer to IPL Parameter Information Block is stored in the absolute lowcore as a virtual address. Save it as the physical address for later use by dump tools - Fix a Queued Direct I/O (QDIO) problem on z/VM guests using QIOASSIST with dedicated (pass through) QDIO-based devices such as FCP, real OSA or HiperSockets - s390's struct statfs and struct statfs64 contain padding, which field-by-field copying does not set. Initialize the respective structures with zeros before filling them and copying to userspace - Grow s390 compat_statfs64, statfs and statfs64 structures f_spare array member to cover padding and simplify things - Remove obsolete SCHED_BOOK and SCHED_DRAWER configs - Remove unneeded S390_CCW_IOMMU and S390_AP_IOM configs * tag 's390-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/iommu: get rid of S390_CCW_IOMMU and S390_AP_IOMMU s390/Kconfig: remove obsolete configs SCHED_{BOOK,DRAWER} s390/uapi: cover statfs padding by growing f_spare statfs: enforce statfs[64] structure initialization s390/qdio: fix do_sqbs() inline assembly constraint s390/ipl: fix IPIB virtual vs physical address confusion s390/topology: honour nr_cpu_ids when adding CPUs s390/cio: include subchannels without devices also for evaluation s390/defconfigs: set CONFIG_INIT_STACK_NONE=y s390/pkey: zeroize key blobs s390/crypto: use vector instructions only if available for ChaCha20 |
||
Ze Gao
|
571a2a50a8 |
rethook, fprobe: do not trace rethook related functions
These functions are already marked as NOKPROBE to prevent recursion and
we have the same reason to blacklist them if rethook is used with fprobe,
since they are beyond the recursion-free region ftrace can guard.
Link: https://lore.kernel.org/all/20230517034510.15639-5-zegao@tencent.com/
Fixes:
|
||
Jason Gunthorpe
|
0f1cbf941d |
s390/iommu: get rid of S390_CCW_IOMMU and S390_AP_IOMMU
These don't do anything anymore, the only user of the symbol was VFIO_CCW/AP which already "depends on VFIO" and VFIO itself selects IOMMU_API. When this was added VFIO was wrongly doing "depends on IOMMU_API" which required some contortions like this to ensure IOMMU_API was turned on. Reviewed-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/0-v2-eb322ce2e547+188f-rm_iommu_ccw_jgg@nvidia.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Lukas Bulwahn
|
e534167cee |
s390/Kconfig: remove obsolete configs SCHED_{BOOK,DRAWER}
Commit
|
||
Ilya Leoshkevich
|
6d9406f80c |
s390/uapi: cover statfs padding by growing f_spare
pahole says: struct compat_statfs64 { ... u32 f_spare[4]; /* 68 16 */ /* size: 88, cachelines: 1, members: 12 */ /* padding: 4 */ struct statfs { ... unsigned int f_spare[4]; /* 68 16 */ /* size: 88, cachelines: 1, members: 12 */ /* padding: 4 */ struct statfs64 { ... unsigned int f_spare[4]; /* 68 16 */ /* size: 88, cachelines: 1, members: 12 */ /* padding: 4 */ One has to keep the existence of padding in mind when working with these structs. Grow f_spare arrays to 5 in order to simplify things. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230504144021.808932-3-iii@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Arnd Bergmann
|
ef104443bf |
procfs: consolidate arch_report_meminfo declaration
The arch_report_meminfo() function is provided by four architectures, with a __weak fallback in procfs itself. On architectures that don't have a custom version, the __weak version causes a warning because of the missing prototype. Remove the architecture specific prototypes and instead add one in linux/proc_fs.h. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for arch/x86 Acked-by: Helge Deller <deller@gmx.de> # parisc Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Message-Id: <20230516195834.551901-1-arnd@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
Alexander Gordeev
|
2facd5d398 |
s390/ipl: fix IPIB virtual vs physical address confusion
The pointer to IPL Parameter Information Block is stored in the absolute lowcore for later use by dump tools. That pointer is a virtual address, though it should be physical instead. Note, this does not fix a real issue, since virtual and physical addresses are currently the same. Suggested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
a33239be2d |
s390/topology: honour nr_cpu_ids when adding CPUs
When SMT thread CPUs are added to CPU masks the nr_cpu_ids limit is not checked and could be exceeded. This leads to a warning for example if CONFIG_DEBUG_PER_CPU_MAPS is set and the command line parameter nr_cpus is set to 1. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Heiko Carstens
|
124acbe275 |
s390/defconfigs: set CONFIG_INIT_STACK_NONE=y
Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid the extra overhead of initializing all stack variables by default. Users who want to have that must change the configuration on their own. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Heiko Carstens
|
8703dd6b23 |
s390/crypto: use vector instructions only if available for ChaCha20
Commit |
||
Heiko Carstens
|
fbac266f09 |
s390: select ARCH_SUPPORTS_INT128
s390 has instructions to support 128 bit arithmetics, e.g. a 64 bit multiply instruction with a 128 bit result. Also 128 bit integer artithmetics are already used in s390 specific architecture code (see e.g. read_persistent_clock64()). Therefore select ARCH_SUPPORTS_INT128. However limit this to clang for now, since gcc generates inefficient code, which may lead to stack overflows, when compiling lib/crypto/curve25519-hacl64.c which depends on ARCH_SUPPORTS_INT128. The gcc generated functions have 6kb stack frames, compared to only 1kb of the code generated with clang. If the kernel is compiled with -Os library calls for __ashlti3(), __ashrti3(), and __lshrti3() may be generated. Similar to arm64 and riscv provide assembler implementations for these functions. Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Thomas Richter
|
1f2597cd36 |
s390/pai_ext: replace atomic_t with refcount_t
The s390 PMU of PAI extension 1 NNPA counters uses atomic_t for reference counting. Replace this with the proper data type refcount_t. No functional change. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Thomas Richter
|
ecc758cee6 |
s390/pai_crypto: replace atomic_t with refcount_t
The s390 PMU of PAI crypto counters uses atomic_t for reference counting. Replace this with the proper data type refcount_t. No functional change. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Lukas Bulwahn
|
c12753d5fa |
s390: remove the unneeded select GCC12_NO_ARRAY_BOUNDS
Commit
|
||
Linus Torvalds
|
b115d85a95 |
Locking changes in v6.4:
- Introduce local{,64}_try_cmpxchg() - a slightly more optimal primitive, which will be used in perf events ring-buffer code. - Simplify/modify rwsems on PREEMPT_RT, to address writer starvation. - Misc cleanups/fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRUvUoRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hlIhAArP33rTKi+HAndQ3UHW3XtmHRxEEQTfiE wvIoN89h58QW4DGMeAV4ltafbIPQAkI233Aogwz903L0qbDV0Ro4OU3XJembRuWl LeOADKwYyypXdOa8XICuY9aIP7e1/h0DF3ySs7inLcwK9JCyAIxnsVHYej+hsRXA kZoXN98T3TR1C0V9UQy4SU3HI1lC3tsG3R9Ti9TnYUg3ygVXhRE9lOQ4kv9lFPVz BNuj2Blj7KNiVaY9kehrhO54THI7NmsCVZO44Rcl48I0KAcFulAmFcNlE7GnR8Nj thj38pU6XAFVHXG8MYjgE+Al+PnK48NtJxexCtHyGvGG4D2aLzRMnkolxAUCcVuK G+UBsQm3ybjYgHgt1zuN6ehcpT+5tULkDH8JA7vrgZYaVgxHzsUaHgYfCCWKnmUY mPR6aImEmYZwZVNLskhe0HT4mq244bp+VnWlnJ6LZK7t/itenvDhqnj7KTi4Bfej lTHplOTitV/8uCEW8V4pX+YTEenVsIQmTc/G3iIabXP/6HzLffA3q4vyW6vKIErE pqrpuFA0Z4GB+pU0mJXt7+I7zscDVthwI055jDyQBjA7IcdVGm2MjQ6xcNRW5FYN UynvaEMocue4ZO4WdFsd1ZBUd9VfoNzGQspBw46DhCL1MEQBYv36SKQNjej/9aRr ilVwqnOWI2s= =mM0A -----END PGP SIGNATURE----- Merge tag 'locking-core-2023-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Introduce local{,64}_try_cmpxchg() - a slightly more optimal primitive, which will be used in perf events ring-buffer code - Simplify/modify rwsems on PREEMPT_RT, to address writer starvation - Misc cleanups/fixes * tag 'locking-core-2023-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/atomic: Correct (cmp)xchg() instrumentation locking/x86: Define arch_try_cmpxchg_local() locking/arch: Wire up local_try_cmpxchg() locking/generic: Wire up local{,64}_try_cmpxchg() locking/atomic: Add generic try_cmpxchg{,64}_local() support locking/rwbase: Mitigate indefinite writer starvation locking/arch: Rename all internal __xchg() names to __arch_xchg() |
||
Linus Torvalds
|
493804a689 |
RISC-V:
- ONE_REG interface to enable/disable SBI extensions - Zbb extension for Guest/VM - AIA CSR virtualization x86: - Fix a long-standing TDP MMU flaw, where unloading roots on a vCPU can result in the root being freed even though the root is completely valid and can be reused as-is (with a TLB flush). s390: - A couple bugfixes. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRU198UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOeiwf+OdIcGxOlt9D/qNys/U+2kkhZK6WU 1Jrwn+S5onea2LGyBucHk3V6O1rWck8aOqfRAmFBQc5s/gu8EySW5Ic17kqgqLwc MSr2c70qw3/6U2LoTeVHx21V4Jgf9iPOZu1LAXf26F0jSCAoWlACO9OQd0rJoDmA cpUtHSAFKxuPA/8Ix+SH4ddzKhCYEs4QejMLmQIgBWyaGShdi8dvaMoKyoceHN2w ntrhSaz+BbCCZKaFwtNCHVaigyioqmwTHjsmMFjzEN3N1ly1XZfN2E7vvT4j3v7l o55eMpJnu7+TVs5LEMvj25nuxgts6UzJnoALJAO2cDsHW5F3FvjO9HGjTg== =NHEm -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull more kvm updates from Paolo Bonzini: "This includes the 6.4 changes for RISC-V, and a few bugfix patches for other architectures. For x86, this closes a longstanding performance issue in the newer and (usually) more scalable page table management code. RISC-V: - ONE_REG interface to enable/disable SBI extensions - Zbb extension for Guest/VM - AIA CSR virtualization x86: - Fix a long-standing TDP MMU flaw, where unloading roots on a vCPU can result in the root being freed even though the root is completely valid and can be reused as-is (with a TLB flush). s390: - A couple of bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: s390: fix race in gmap_make_secure() KVM: s390: pv: fix asynchronous teardown for small VMs KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated RISC-V: KVM: Virtualize per-HART AIA CSRs RISC-V: KVM: Use bitmap for irqs_pending and irqs_pending_mask RISC-V: KVM: Add ONE_REG interface for AIA CSRs RISC-V: KVM: Implement subtype for CSR ONE_REG interface RISC-V: KVM: Initial skeletal support for AIA RISC-V: KVM: Drop the _MASK suffix from hgatp.VMID mask defines RISC-V: Detect AIA CSRs from ISA string RISC-V: Add AIA related CSR defines RISC-V: KVM: Allow Zbb extension for Guest/VM RISC-V: KVM: Add ONE_REG interface to enable/disable SBI extensions RISC-V: KVM: Alphabetize selects KVM: RISC-V: Retry fault if vma_lookup() results become invalid |
||
Paolo Bonzini
|
7a8016d956 |
For 6.4
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEoWuZBM6M3lCBSfTnuARItAMU6BMFAmRT4cIACgkQuARItAMU 6BPZ6xAA0T0AhPUuc4JyKH4p2ZFaH61+1YpGHlxxCyIF1gm4JMHS0i1CRii+o2Zh k3w76MafGau9ADXwrjJAh+C+WihwwgTtSyN2/amOn70K75iK7GGbT125c6VX6A1I 6L6lFEO/3JURrhQE4TthgH6HUVyXBC4XobQqajWtxGgcq/QRkjtgdbLPPv9HuH23 NgXZ85+lpPrak9U0Zzu2ez1O9VIlABWbTE6B6DgcyQshwqBZDOSFRTxxJ883XtqH 7sRMmVJBkiF3HRqvDAmxI0eUOrR8YNtl1PN441iadmvfqG4IDbPtuJ0UtiyNagXT g7UYGvv7Qj2z4y2QhQSgLiIC0Zrl/T6Vw/7oXt0CVybhEMuAAeshWzHIl5aR5aKt 7sY2ijUX290zwuACft9ZjDspUWyOPkvqym5UrldTj1ZaYV9w7iHNezFbXTnXOBh9 vEiet5p3SCeQNymFKXy2R+8YJ6IIsH3Mxf6SYo2CQFNNjOzfUHwKMix/uArFdXF5 CQD6jM9NXTiPIjwpXIJjyutJ6USWZhsMCneHtItzfWTa1YnBSZTwIWU1fB05Snk/ BcDVfJ9Pq142N983SNXaBYIGzxpYPUHYZQmKvAzbeLHcaxrgLJM9/x8zcXIxc8Wa sEFoo5W8DCIJRqjk7dsLyLEvPJdoBZidg6sd0MwHRGX0kDRzHEo= =PPnV -----END PGP SIGNATURE----- Merge tag 'kvm-s390-next-6.4-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD For 6.4 |
||
Linus Torvalds
|
15fb96a35d |
- Some DAMON cleanups from Kefeng Wang
- Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE ioctl's behavior more similar to KSM's behavior. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFLsxAAKCRDdBJ7gKXxA jl8yAQCqjstPsOULf9QN0z4bGAUhY+Wj4ERz1jbKSIuhFCJWiQEAgQvgRXObKjmi OtUB0Ek4CMDCQzbyIQ1Bhp3kxi6+Jgs= =AbyC -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - Some DAMON cleanups from Kefeng Wang - Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE ioctl's behavior more similar to KSM's behavior. [ Andrew called these "final", but I suspect we'll have a series fixing up the fact that the last commit in the dmapools series in the previous pull seems to have unintentionally just reverted all the other commits in the same series.. - Linus ] * tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: hwpoison: coredump: support recovery from dump_user_range() mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page() mm/ksm: move disabling KSM from s390/gmap code to KSM code selftests/ksm: ksm_functional_tests: add prctl unmerge test mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0 mm/damon/paddr: fix missing folio_sz update in damon_pa_young() mm/damon/paddr: minor refactor of damon_pa_mark_accessed_or_deactivate() mm/damon/paddr: minor refactor of damon_pa_pageout() |
||
Claudio Imbrenda
|
c148dc8e2f |
KVM: s390: fix race in gmap_make_secure()
Fix a potential race in gmap_make_secure() and remove the last user of
follow_page() without FOLL_GET.
The old code is locking something it doesn't have a reference to, and
as explained by Jason and David in this discussion:
https://lore.kernel.org/linux-mm/Y9J4P%2FRNvY1Ztn0Q@nvidia.com/
it can lead to all kind of bad things, including the page getting
unmapped (MADV_DONTNEED), freed, reallocated as a larger folio and the
unlock_page() would target the wrong bit.
There is also another race with the FOLL_WRITE, which could race
between the follow_page() and the get_locked_pte().
The main point is to remove the last use of follow_page() without
FOLL_GET or FOLL_PIN, removing the races can be considered a nice
bonus.
Link: https://lore.kernel.org/linux-mm/Y9J4P%2FRNvY1Ztn0Q@nvidia.com/
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Fixes:
|
||
Claudio Imbrenda
|
292a7d6fca |
KVM: s390: pv: fix asynchronous teardown for small VMs
On machines without the Destroy Secure Configuration Fast UVC, the
topmost level of page tables is set aside and freed asynchronously
as last step of the asynchronous teardown.
Each gmap has a host_to_guest radix tree mapping host (userspace)
addresses (with 1M granularity) to gmap segment table entries (pmds).
If a guest is smaller than 2GB, the topmost level of page tables is the
segment table (i.e. there are only 2 levels). Replacing it means that
the pointers in the host_to_guest mapping would become stale and cause
all kinds of nasty issues.
This patch fixes the issue by disallowing asynchronous teardown for
guests with only 2 levels of page tables. Userspace should (and already
does) try using the normal destroy if the asynchronous one fails.
Update s390_replace_asce so it refuses to replace segment type ASCEs.
This is still needed in case the normal destroy VM fails.
Fixes:
|
||
David Hildenbrand
|
2c281f54f5 |
mm/ksm: move disabling KSM from s390/gmap code to KSM code
Let's factor out actual disabling of KSM. The existing "mm->def_flags &= ~VM_MERGEABLE;" was essentially a NOP and can be dropped, because def_flags should never include VM_MERGEABLE. Note that we don't currently prevent re-enabling KSM. This should now be faster in case KSM was never enabled, because we only conditionally iterate all VMAs. Further, it certainly looks cleaner. Link: https://lkml.kernel.org/r/20230422210156.33630-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Acked-by: Stefan Roesch <shr@devkernel.io> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
c8c655c34e |
s390:
* More phys_to_virt conversions * Improvement of AP management for VSIE (nested virtualization) ARM64: * Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. * New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. * Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. * A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. * The usual selftest fixes and improvements. KVM x86 changes for 6.4: * Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) * Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool * Move AMD_PSFD to cpufeatures.h and purge KVM's definition * Avoid unnecessary writes+flushes when the guest is only adding new PTEs * Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations * Clean up the range-based flushing APIs * Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry * Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() * Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware * Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features * Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES * Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest x86 AMD: * Add support for virtual NMIs * Fixes for edge cases related to virtual interrupts x86 Intel: * Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() * Fix a bug in emulation of ENCLS in compatibility mode * Allow emulation of NOP and PAUSE for L2 * AMX selftests improvements * Misc cleanups MIPS: * Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: * Drop unnecessary casts from "void *" throughout kvm_main.c * Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: * Fix goof introduced by the conversion to rST -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRNExkUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNyjwf+MkzDael9y9AsOZoqhEZ5OsfQYJ32 Im5ZVYsPRU2K5TuoWql6meIihgclCj1iIU32qYHa2F1WYt2rZ72rJp+HoY8b+TaI WvF0pvNtqQyg3iEKUBKPA4xQ6mj7RpQBw86qqiCHmlfNt0zxluEGEPxH8xrWcfhC huDQ+NUOdU7fmJ3rqGitCvkUbCuZNkw3aNPR8dhU8RAWrwRzP2hBOmdxIeo81WWY XMEpJSijbGpXL9CvM0Jz9nOuMJwZwCCBGxg1vSQq0xTfLySNMxzvWZC2GFaBjucb j0UOQ7yE0drIZDVhd3sdNslubXXU6FcSEzacGQb9aigMUon3Tem9SHi7Kw== =S2Hq -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "s390: - More phys_to_virt conversions - Improvement of AP management for VSIE (nested virtualization) ARM64: - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. x86: - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) - Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool - Move AMD_PSFD to cpufeatures.h and purge KVM's definition - Avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features - Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - AMD SVM: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts - Intel AMX: - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() - Fix a bug in emulation of ENCLS in compatibility mode - Allow emulation of NOP and PAUSE for L2 - AMX selftests improvements - Misc cleanups MIPS: - Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: - Drop unnecessary casts from "void *" throughout kvm_main.c - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: - Fix goof introduced by the conversion to rST" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits) KVM: s390: pci: fix virtual-physical confusion on module unload/load KVM: s390: vsie: clarifications on setting the APCB KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init() KVM: selftests: Test the PMU event "Instructions retired" KVM: selftests: Copy full counter values from guest in PMU event filter test KVM: selftests: Use error codes to signal errors in PMU event filter test KVM: selftests: Print detailed info in PMU event filter asserts KVM: selftests: Add helpers for PMC asserts in PMU event filter test KVM: selftests: Add a common helper for the PMU event filter guest code KVM: selftests: Fix spelling mistake "perrmited" -> "permitted" KVM: arm64: vhe: Drop extra isb() on guest exit KVM: arm64: vhe: Synchronise with page table walker on MMU update KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc() KVM: arm64: nvhe: Synchronise with page table walker on TLBI KVM: arm64: Handle 32bit CNTPCTSS traps KVM: arm64: nvhe: Synchronise with page table walker on vcpu run KVM: arm64: vgic: Don't acquire its_lock before config_lock KVM: selftests: Add test to verify KVM's supported XCR0 ... |
||
Linus Torvalds
|
10de638d8e |
s390 updates for the 6.4 merge window
- Add support for stackleak feature. Also allow specifying architecture-specific stackleak poison function to enable faster implementation. On s390, the mvc-based implementation helps decrease typical overhead from a factor of 3 to just 25% - Convert all assembler files to use SYM* style macros, deprecating the ENTRY() macro and other annotations. Select ARCH_USE_SYM_ANNOTATIONS - Improve KASLR to also randomize module and special amode31 code base load addresses - Rework decompressor memory tracking to support memory holes and improve error handling - Add support for protected virtualization AP binding - Add support for set_direct_map() calls - Implement set_memory_rox() and noexec module_alloc() - Remove obsolete overriding of mem*() functions for KASAN - Rework kexec/kdump to avoid using nodat_stack to call purgatory - Convert the rest of the s390 code to use flexible-array member instead of a zero-length array - Clean up uaccess inline asm - Enable ARCH_HAS_MEMBARRIER_SYNC_CORE - Convert to using CONFIG_FUNCTION_ALIGNMENT and enable DEBUG_FORCE_FUNCTION_ALIGN_64B - Resolve last_break in userspace fault reports - Simplify one-level sysctl registration - Clean up branch prediction handling - Rework CPU counter facility to retrieve available counter sets just once - Other various small fixes and improvements all over the code -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmRM8pwACgkQjYWKoQLX FBjV1AgAlvAhu1XkwOdwqdT4GqE8pcN4XXzydog1MYihrSO2PdgWAxpEW7o2QURN W+3xa6RIqt7nX2YBiwTanMZ12TYaFY7noGl3eUpD/NhueprweVirVl7VZUEuRoW/ j0mbx77xsVzLfuDFxkpVwE6/j+tTO78kLyjUHwcN9rFVUaL7/orJneDJf+V8fZG0 sHLOv0aljF7Jr2IIkw82lCmW/vdk7k0dACWMXK2kj1H3dIK34B9X4AdKDDf/WKXk /OSElBeZ93tSGEfNDRIda6iR52xocROaRnQAaDtargKFl9VO0/dN9ADxO+SLNHjN pFE/9VD6xT/xo4IuZZh/Z3TcYfiLvA== =Geqx -----END PGP SIGNATURE----- Merge tag 's390-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Add support for stackleak feature. Also allow specifying architecture-specific stackleak poison function to enable faster implementation. On s390, the mvc-based implementation helps decrease typical overhead from a factor of 3 to just 25% - Convert all assembler files to use SYM* style macros, deprecating the ENTRY() macro and other annotations. Select ARCH_USE_SYM_ANNOTATIONS - Improve KASLR to also randomize module and special amode31 code base load addresses - Rework decompressor memory tracking to support memory holes and improve error handling - Add support for protected virtualization AP binding - Add support for set_direct_map() calls - Implement set_memory_rox() and noexec module_alloc() - Remove obsolete overriding of mem*() functions for KASAN - Rework kexec/kdump to avoid using nodat_stack to call purgatory - Convert the rest of the s390 code to use flexible-array member instead of a zero-length array - Clean up uaccess inline asm - Enable ARCH_HAS_MEMBARRIER_SYNC_CORE - Convert to using CONFIG_FUNCTION_ALIGNMENT and enable DEBUG_FORCE_FUNCTION_ALIGN_64B - Resolve last_break in userspace fault reports - Simplify one-level sysctl registration - Clean up branch prediction handling - Rework CPU counter facility to retrieve available counter sets just once - Other various small fixes and improvements all over the code * tag 's390-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (118 commits) s390/stackleak: provide fast __stackleak_poison() implementation stackleak: allow to specify arch specific stackleak poison function s390: select ARCH_USE_SYM_ANNOTATIONS s390/mm: use VM_FLUSH_RESET_PERMS in module_alloc() s390: wire up memfd_secret system call s390/mm: enable ARCH_HAS_SET_DIRECT_MAP s390/mm: use BIT macro to generate SET_MEMORY bit masks s390/relocate_kernel: adjust indentation s390/relocate_kernel: use SYM* macros instead of ENTRY(), etc. s390/entry: use SYM* macros instead of ENTRY(), etc. s390/purgatory: use SYM* macros instead of ENTRY(), etc. s390/kprobes: use SYM* macros instead of ENTRY(), etc. s390/reipl: use SYM* macros instead of ENTRY(), etc. s390/head64: use SYM* macros instead of ENTRY(), etc. s390/earlypgm: use SYM* macros instead of ENTRY(), etc. s390/mcount: use SYM* macros instead of ENTRY(), etc. s390/crc32le: use SYM* macros instead of ENTRY(), etc. s390/crc32be: use SYM* macros instead of ENTRY(), etc. s390/crypto,chacha: use SYM* macros instead of ENTRY(), etc. s390/amode31: use SYM* macros instead of ENTRY(), etc. ... |
||
Andrzej Hajda
|
068550631f |
locking/arch: Rename all internal __xchg() names to __arch_xchg()
Decrease the probability of this internal facility to be used by driver code. Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Palmer Dabbelt <palmer@rivosinc.com> [riscv] Link: https://lore.kernel.org/r/20230118154450.73842-1-andrzej.hajda@intel.com Cc: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
f20730efbd |
SMP cross-CPU function-call updates for v6.4:
- Remove diagnostics and adjust config for CSD lock diagnostics - Add a generic IPI-sending tracepoint, as currently there's no easy way to instrument IPI origins: it's arch dependent and for some major architectures it's not even consistently available. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK438RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jJ5Q/5AZ0HGpyqwdFK8GmGznyu5qjP5HwV9pPq gZQScqSy4tZEeza4TFMi83CoXSg9uJ7GlYJqqQMKm78LGEPomnZtXXC7oWvTA9M5 M/jAvzytmvZloSCXV6kK7jzSejMHhag97J/BjTYhZYQpJ9T+hNC87XO6J6COsKr9 lPIYqkFrIkQNr6B0U11AQfFejRYP1ics2fnbnZL86G/zZAc6x8EveM3KgSer2iHl KbrO+xcYyGY8Ef9P2F72HhEGFfM3WslpT1yzqR3sm4Y+fuMG0oW3qOQuMJx0ZhxT AloterY0uo6gJwI0P9k/K4klWgz81Tf/zLb0eBAtY2uJV9Fo3YhPHuZC7jGPGAy3 JusW2yNYqc8erHVEMAKDUsl/1KN4TE2uKlkZy98wno+KOoMufK5MA2e2kPPqXvUi Jk9RvFolnWUsexaPmCftti0OCv3YFiviVAJ/t0pchfmvvJA2da0VC9hzmEXpLJVF 25nBTV/1uAOrWvOpCyo3ElrC2CkQVkFmK5rXMDdvf6ib0Nid4vFcCkCSLVfu+ePB 11mi7QYro+CcnOug1K+yKogUDmsZgV/u1kUwgQzTIpZ05Kkb49gUiXw9L2RGcBJh yoDoiI66KPR7PWQ2qBdQoXug4zfEEtWG0O9HNLB0FFRC3hu7I+HHyiUkBWs9jasK PA5+V7HcQRk= =Wp7f -----END PGP SIGNATURE----- Merge tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP cross-CPU function-call updates from Ingo Molnar: - Remove diagnostics and adjust config for CSD lock diagnostics - Add a generic IPI-sending tracepoint, as currently there's no easy way to instrument IPI origins: it's arch dependent and for some major architectures it's not even consistently available. * tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: trace,smp: Trace all smp_function_call*() invocations trace: Add trace_ipi_send_cpu() sched, smp: Trace smp callback causing an IPI smp: reword smp call IPI comment treewide: Trace IPIs sent via smp_send_reschedule() irq_work: Trace self-IPIs sent via arch_irq_work_raise() smp: Trace IPIs sent via arch_send_call_function_ipi_mask() sched, smp: Trace IPIs sent via send_call_function_single_ipi() trace: Add trace_ipi_send_cpumask() kernel/smp: Make csdlock_debug= resettable locking/csd_lock: Remove per-CPU data indirection from CSD lock debugging locking/csd_lock: Remove added data from CSD lock debugging locking/csd_lock: Add Kconfig option for csd_debug default |
||
Linus Torvalds
|
2aff7c706c |
Objtool changes for v6.4:
- Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did this inconsistently follow this new, common convention, and fix all the fallout that objtool can now detect statically. - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it. - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code. - Generate ORC data for __pfx code - Add more __noreturn annotations to various kernel startup/shutdown/panic functions. - Misc improvements & fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK1x0RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1ghxQ/+IkCynMYtdF5OG9YwbcGJqsPSfOPMEcEM pUSFYg+gGPBDT/fJfcVSqvUtdnWbLC2kXt9yiswXz3X3J2nmNkBk5YKQftsNDcul TmKeqIIAK51XTncpegKH0EGnOX63oZ9Vxa8CTPdDlb+YF23Km2FoudGRI9F5qbUd LoraXqGYeiaeySkGyWmZVl6Uc8dIxnMkTN3H/oI9aB6TOrsi059hAtFcSaFfyemP c4LqXXCH7k2baiQt+qaLZ8cuZVG/+K5r2N2cmjO5kmJc6ynIaFnfMe4XxZLjp5LT /PulYI15bXkvSARKx5CRh/CDHMOx5Blw+ASO0RhWbdy0WH4ZhhcaVF5AeIpPW86a 1LBcz97rMp72WmvKgrJeVO1r9+ll4SI6/YKGJRsxsCMdP3hgFpqntXyVjTFNdTM1 0gH6H5v55x06vJHvhtTk8SR3PfMTEM2fRU5jXEOrGowoGifx+wNUwORiwj6LE3KQ SKUdT19RNzoW3VkFxhgk65ThK1S7YsJUKRoac3YdhttpqqqtFV//erenrZoR4k/p vzvKy68EQ7RCNyD5wNWNFe0YjeJl5G8gQ8bUm4Xmab7djjgz+pn4WpQB8yYKJLAo x9dqQ+6eUbw3Hcgk6qQ9E+r/svbulnAL0AeALAWK/91DwnZ2mCzKroFkLN7napKi fRho4CqzrtM= =NwEV -----END PGP SIGNATURE----- Merge tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did this inconsistently follow this new, common convention, and fix all the fallout that objtool can now detect statically - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code - Generate ORC data for __pfx code - Add more __noreturn annotations to various kernel startup/shutdown and panic functions - Misc improvements & fixes * tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/hyperv: Mark hv_ghcb_terminate() as noreturn scsi: message: fusion: Mark mpt_halt_firmware() __noreturn x86/cpu: Mark {hlt,resume}_play_dead() __noreturn btrfs: Mark btrfs_assertfail() __noreturn objtool: Include weak functions in global_noreturns check cpu: Mark nmi_panic_self_stop() __noreturn cpu: Mark panic_smp_self_stop() __noreturn arm64/cpu: Mark cpu_park_loop() and friends __noreturn x86/head: Mark *_start_kernel() __noreturn init: Mark start_kernel() __noreturn init: Mark [arch_call_]rest_init() __noreturn objtool: Generate ORC data for __pfx code x86/linkage: Fix padding for typed functions objtool: Separate prefix code from stack validation code objtool: Remove superfluous dead_end_function() check objtool: Add symbol iteration helpers objtool: Add WARN_INSN() scripts/objdump-func: Support multiple functions context_tracking: Fix KCSAN noinstr violation objtool: Add stackleak instrumentation to uaccess safe list ... |
||
Linus Torvalds
|
7fa8a8ee94 |
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page(). - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY= =J+Dj -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ... |
||
Linus Torvalds
|
b6a7828502 |
modules-6.4-rc1
The summary of the changes for this pull requests is: * Song Liu's new struct module_memory replacement * Nick Alcock's MODULE_LICENSE() removal for non-modules * My cleanups and enhancements to reduce the areas where we vmalloc module memory for duplicates, and the respective debug code which proves the remaining vmalloc pressure comes from userspace. Most of the changes have been in linux-next for quite some time except the minor fixes I made to check if a module was already loaded prior to allocating the final module memory with vmalloc and the respective debug code it introduces to help clarify the issue. Although the functional change is small it is rather safe as it can only *help* reduce vmalloc space for duplicates and is confirmed to fix a bootup issue with over 400 CPUs with KASAN enabled. I don't expect stable kernels to pick up that fix as the cleanups would have also had to have been picked up. Folks on larger CPU systems with modules will want to just upgrade if vmalloc space has been an issue on bootup. Given the size of this request, here's some more elaborate details on this pull request. The functional change change in this pull request is the very first patch from Song Liu which replaces the struct module_layout with a new struct module memory. The old data structure tried to put together all types of supported module memory types in one data structure, the new one abstracts the differences in memory types in a module to allow each one to provide their own set of details. This paves the way in the future so we can deal with them in a cleaner way. If you look at changes they also provide a nice cleanup of how we handle these different memory areas in a module. This change has been in linux-next since before the merge window opened for v6.3 so to provide more than a full kernel cycle of testing. It's a good thing as quite a bit of fixes have been found for it. Jason Baron then made dynamic debug a first class citizen module user by using module notifier callbacks to allocate / remove module specific dynamic debug information. Nick Alcock has done quite a bit of work cross-tree to remove module license tags from things which cannot possibly be module at my request so to: a) help him with his longer term tooling goals which require a deterministic evaluation if a piece a symbol code could ever be part of a module or not. But quite recently it is has been made clear that tooling is not the only one that would benefit. Disambiguating symbols also helps efforts such as live patching, kprobes and BPF, but for other reasons and R&D on this area is active with no clear solution in sight. b) help us inch closer to the now generally accepted long term goal of automating all the MODULE_LICENSE() tags from SPDX license tags In so far as a) is concerned, although module license tags are a no-op for non-modules, tools which would want create a mapping of possible modules can only rely on the module license tag after the commit |
||
Linus Torvalds
|
556eb8b791 |
Driver core changes for 6.4-rc1
Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5 LEGadNS38k5fs+73UaxV =7K4B -----END PGP SIGNATURE----- Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits) device property: make device_property functions take const device * driver core: update comments in device_rename() driver core: Don't require dynamic_debug for initcall_debug probe timing firmware_loader: rework crypto dependencies firmware_loader: Strip off \n from customized path zram: fix up permission for the hot_add sysfs file cacheinfo: Add use_arch[|_cache]_info field/function arch_topology: Remove early cacheinfo error message if -ENOENT cacheinfo: Check cache properties are present in DT cacheinfo: Check sib_leaf in cache_leaves_are_shared() cacheinfo: Allow early level detection when DT/ACPI info is missing/broken cacheinfo: Add arm64 early level initializer implementation cacheinfo: Add arch specific early level initializer tty: make tty_class a static const structure driver core: class: remove struct class_interface * from callbacks driver core: class: mark the struct class in struct class_interface constant driver core: class: make class_register() take a const * driver core: class: mark class_release() as taking a const * driver core: remove incorrect comment for device_create* MIPS: vpe-cmp: remove module owner pointer from struct class usage. ... |
||
Linus Torvalds
|
6e98b09da9 |
Networking changes for 6.4.
Core ---- - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the default value allows for better BIG TCP performances. - Reduce compound page head access for zero-copy data transfers. - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible. - Threaded NAPI improvements, adding defer skb free support and unneeded softirq avoidance. - Address dst_entry reference count scalability issues, via false sharing avoidance and optimize refcount tracking. - Add lockless accesses annotation to sk_err[_soft]. - Optimize again the skb struct layout. - Extends the skb drop reasons to make it usable by multiple subsystems. - Better const qualifier awareness for socket casts. BPF --- - Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses. - Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward. - Add more precise memory usage reporting for all BPF map types. - Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params. - Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton. - Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities. - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc. - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps. - Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps. - Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree. - Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them. - Add ARM32 USDT support to libbpf. - Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations. Protocols --------- - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value indicates the provenance of the IP address. - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition. - Add the handshake upcall mechanism, allowing the user-space to implement generic TLS handshake on kernel's behalf. - Bridge: support per-{Port, VLAN} neighbor suppression, increasing resilience to nodes failures. - SCTP: add support for Fair Capacity and Weighted Fair Queueing schedulers. - MPTCP: delay first subflow allocation up to its first usage. This will allow for later better LSM interaction. - xfrm: Remove inner/outer modes from input/output path. These are not needed anymore. - WiFi: - reduced neighbor report (RNR) handling for AP mode - HW timestamping support - support for randomized auth/deauth TA for PASN privacy - per-link debugfs for multi-link - TC offload support for mac80211 drivers - mac80211 mesh fast-xmit and fast-rx support - enable Wi-Fi 7 (EHT) mesh support Netfilter --------- - Add nf_tables 'brouting' support, to force a packet to be routed instead of being bridged. - Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header. This is needed for BIT TCP support. - The iptables 32bit compat interface isn't compiled in by default anymore. - Move ip(6)tables builtin icmp matches to the udptcp one. This has the advantage that icmp/icmpv6 match doesn't load the iptables/ip6tables modules anymore when iptables-nft is used. - Extended netlink error report for netdevice in flowtables and netdev/chains. Allow for incrementally add/delete devices to netdev basechain. Allow to create netdev chain without device. Driver API ---------- - Remove redundant Device Control Error Reporting Enable, as PCI core has already error reporting enabled at enumeration time. - Move Multicast DB netlink handlers to core, allowing devices other then bridge to use them. - Allow the page_pool to directly recycle the pages from safely localized NAPI. - Implement lockless TX queue stop/wake combo macros, allowing for further code de-duplication and sanitization. - Add YNL support for user headers and struct attrs. - Add partial YNL specification for devlink. - Add partial YNL specification for ethtool. - Add tc-mqprio and tc-taprio support for preemptible traffic classes. - Add tx push buf len param to ethtool, specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device. - Add basic LED support for switch/phy. - Add NAPI documentation, stop relaying on external links. - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory work to make the hardware timestamping layer selectable by user space. - Add transceiver support and improve the error messages for CAN-FD controllers. New hardware / drivers ---------------------- - Ethernet: - AMD/Pensando core device support - MediaTek MT7981 SoC - MediaTek MT7988 SoC - Broadcom BCM53134 embedded switch - Texas Instruments CPSW9G ethernet switch - Qualcomm EMAC3 DWMAC ethernet - StarFive JH7110 SoC - NXP CBTX ethernet PHY - WiFi: - Apple M1 Pro/Max devices - RealTek rtl8710bu/rtl8188gu - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset - Bluetooth: - Realtek RTL8821CS, RTL8851B, RTL8852BS - Mediatek MT7663, MT7922 - NXP w8997 - Actions Semi ATS2851 - QTI WCN6855 - Marvell 88W8997 - Can: - STMicroelectronics bxcan stm32f429 Drivers ------- - Ethernet NICs: - Intel (1G, icg): - add tracking and reporting of QBV config errors. - add support for configuring max SDU for each Tx queue. - Intel (100G, ice): - refactor mailbox overflow detection to support Scalable IOV - GNSS interface optimization - Intel (i40e): - support XDP multi-buffer - nVidia/Mellanox: - add the support for linux bridge multicast offload - enable TC offload for egress and engress MACVLAN over bond - add support for VxLAN GBP encap/decap flows offload - extend packet offload to fully support libreswan - support tunnel mode in mlx5 IPsec packet offload - extend XDP multi-buffer support - support MACsec VLAN offload - add support for dynamic msix vectors allocation - drop RX page_cache and fully use page_pool - implement thermal zone to report NIC temperature - Netronome/Corigine: - add support for multi-zone conntrack offload - Solarflare/Xilinx: - support offloading TC VLAN push/pop actions to the MAE - support TC decap rules - support unicast PTP - Other NICs: - Broadcom (bnxt): enforce software based freq adjustments only on shared PHC NIC - RealTek (r8169): refactor to addess ASPM issues during NAPI poll. - Micrel (lan8841): add support for PTP_PF_PEROUT - Cadence (macb): enable PTP unicast - Engleder (tsnep): add XDP socket zero-copy support - virtio-net: implement exact header length guest feature - veth: add page_pool support for page recycling - vxlan: add MDB data path support - gve: add XDP support for GQI-QPL format - geneve: accept every ethertype - macvlan: allow some packets to bypass broadcast queue - mana: add support for jumbo frame - Ethernet high-speed switches: - Microchip (sparx5): Add support for TC flower templates. - Ethernet embedded switches: - Broadcom (b54): - configure 6318 and 63268 RGMII ports - Marvell (mv88e6xxx): - faster C45 bus scan - Microchip: - lan966x: - add support for IS1 VCAP - better TX/RX from/to CPU performances - ksz9477: add ETS Qdisc support - ksz8: enhance static MAC table operations and error handling - sama7g5: add PTP capability - NXP (ocelot): - add support for external ports - add support for preemptible traffic classes - Texas Instruments: - add CPSWxG SGMII support for J7200 and J721E - Intel WiFi (iwlwifi): - preparation for Wi-Fi 7 EHT and multi-link support - EHT (Wi-Fi 7) sniffer support - hardware timestamping support for some devices/firwmares - TX beacon protection on newer hardware - Qualcomm 802.11ax WiFi (ath11k): - MU-MIMO parameters support - ack signal support for management packets - RealTek WiFi (rtw88): - SDIO bus support - better support for some SDIO devices (e.g. MAC address from efuse) - RealTek WiFi (rtw89): - HW scan support for 8852b - better support for 6 GHz scanning - support for various newer firmware APIs - framework firmware backwards compatibility - MediaTek WiFi (mt76): - P2P support - mesh A-MSDU support - EHT (Wi-Fi 7) support - coredump support Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B 9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6 cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7 Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9 aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ vP7ugFnttneg =ITVa -----END PGP SIGNATURE----- Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the default value allows for better BIG TCP performances - Reduce compound page head access for zero-copy data transfers - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible - Threaded NAPI improvements, adding defer skb free support and unneeded softirq avoidance - Address dst_entry reference count scalability issues, via false sharing avoidance and optimize refcount tracking - Add lockless accesses annotation to sk_err[_soft] - Optimize again the skb struct layout - Extends the skb drop reasons to make it usable by multiple subsystems - Better const qualifier awareness for socket casts BPF: - Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses - Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward - Add more precise memory usage reporting for all BPF map types - Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params - Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton - Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps - Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps - Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree - Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them - Add ARM32 USDT support to libbpf - Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations Protocols: - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value indicates the provenance of the IP address - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition - Add the handshake upcall mechanism, allowing the user-space to implement generic TLS handshake on kernel's behalf - Bridge: support per-{Port, VLAN} neighbor suppression, increasing resilience to nodes failures - SCTP: add support for Fair Capacity and Weighted Fair Queueing schedulers - MPTCP: delay first subflow allocation up to its first usage. This will allow for later better LSM interaction - xfrm: Remove inner/outer modes from input/output path. These are not needed anymore - WiFi: - reduced neighbor report (RNR) handling for AP mode - HW timestamping support - support for randomized auth/deauth TA for PASN privacy - per-link debugfs for multi-link - TC offload support for mac80211 drivers - mac80211 mesh fast-xmit and fast-rx support - enable Wi-Fi 7 (EHT) mesh support Netfilter: - Add nf_tables 'brouting' support, to force a packet to be routed instead of being bridged - Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header. This is needed for BIT TCP support - The iptables 32bit compat interface isn't compiled in by default anymore - Move ip(6)tables builtin icmp matches to the udptcp one. This has the advantage that icmp/icmpv6 match doesn't load the iptables/ip6tables modules anymore when iptables-nft is used - Extended netlink error report for netdevice in flowtables and netdev/chains. Allow for incrementally add/delete devices to netdev basechain. Allow to create netdev chain without device Driver API: - Remove redundant Device Control Error Reporting Enable, as PCI core has already error reporting enabled at enumeration time - Move Multicast DB netlink handlers to core, allowing devices other then bridge to use them - Allow the page_pool to directly recycle the pages from safely localized NAPI - Implement lockless TX queue stop/wake combo macros, allowing for further code de-duplication and sanitization - Add YNL support for user headers and struct attrs - Add partial YNL specification for devlink - Add partial YNL specification for ethtool - Add tc-mqprio and tc-taprio support for preemptible traffic classes - Add tx push buf len param to ethtool, specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device - Add basic LED support for switch/phy - Add NAPI documentation, stop relaying on external links - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory work to make the hardware timestamping layer selectable by user space - Add transceiver support and improve the error messages for CAN-FD controllers New hardware / drivers: - Ethernet: - AMD/Pensando core device support - MediaTek MT7981 SoC - MediaTek MT7988 SoC - Broadcom BCM53134 embedded switch - Texas Instruments CPSW9G ethernet switch - Qualcomm EMAC3 DWMAC ethernet - StarFive JH7110 SoC - NXP CBTX ethernet PHY - WiFi: - Apple M1 Pro/Max devices - RealTek rtl8710bu/rtl8188gu - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset - Bluetooth: - Realtek RTL8821CS, RTL8851B, RTL8852BS - Mediatek MT7663, MT7922 - NXP w8997 - Actions Semi ATS2851 - QTI WCN6855 - Marvell 88W8997 - Can: - STMicroelectronics bxcan stm32f429 Drivers: - Ethernet NICs: - Intel (1G, icg): - add tracking and reporting of QBV config errors - add support for configuring max SDU for each Tx queue - Intel (100G, ice): - refactor mailbox overflow detection to support Scalable IOV - GNSS interface optimization - Intel (i40e): - support XDP multi-buffer - nVidia/Mellanox: - add the support for linux bridge multicast offload - enable TC offload for egress and engress MACVLAN over bond - add support for VxLAN GBP encap/decap flows offload - extend packet offload to fully support libreswan - support tunnel mode in mlx5 IPsec packet offload - extend XDP multi-buffer support - support MACsec VLAN offload - add support for dynamic msix vectors allocation - drop RX page_cache and fully use page_pool - implement thermal zone to report NIC temperature - Netronome/Corigine: - add support for multi-zone conntrack offload - Solarflare/Xilinx: - support offloading TC VLAN push/pop actions to the MAE - support TC decap rules - support unicast PTP - Other NICs: - Broadcom (bnxt): enforce software based freq adjustments only on shared PHC NIC - RealTek (r8169): refactor to addess ASPM issues during NAPI poll - Micrel (lan8841): add support for PTP_PF_PEROUT - Cadence (macb): enable PTP unicast - Engleder (tsnep): add XDP socket zero-copy support - virtio-net: implement exact header length guest feature - veth: add page_pool support for page recycling - vxlan: add MDB data path support - gve: add XDP support for GQI-QPL format - geneve: accept every ethertype - macvlan: allow some packets to bypass broadcast queue - mana: add support for jumbo frame - Ethernet high-speed switches: - Microchip (sparx5): Add support for TC flower templates - Ethernet embedded switches: - Broadcom (b54): - configure 6318 and 63268 RGMII ports - Marvell (mv88e6xxx): - faster C45 bus scan - Microchip: - lan966x: - add support for IS1 VCAP - better TX/RX from/to CPU performances - ksz9477: add ETS Qdisc support - ksz8: enhance static MAC table operations and error handling - sama7g5: add PTP capability - NXP (ocelot): - add support for external ports - add support for preemptible traffic classes - Texas Instruments: - add CPSWxG SGMII support for J7200 and J721E - Intel WiFi (iwlwifi): - preparation for Wi-Fi 7 EHT and multi-link support - EHT (Wi-Fi 7) sniffer support - hardware timestamping support for some devices/firwmares - TX beacon protection on newer hardware - Qualcomm 802.11ax WiFi (ath11k): - MU-MIMO parameters support - ack signal support for management packets - RealTek WiFi (rtw88): - SDIO bus support - better support for some SDIO devices (e.g. MAC address from efuse) - RealTek WiFi (rtw89): - HW scan support for 8852b - better support for 6 GHz scanning - support for various newer firmware APIs - framework firmware backwards compatibility - MediaTek WiFi (mt76): - P2P support - mesh A-MSDU support - EHT (Wi-Fi 7) support - coredump support" * tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits) net: phy: hide the PHYLIB_LEDS knob net: phy: marvell-88x2222: remove unnecessary (void*) conversions tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. net: amd: Fix link leak when verifying config failed net: phy: marvell: Fix inconsistent indenting in led_blink_set lan966x: Don't use xdp_frame when action is XDP_TX tsnep: Add XDP socket zero-copy TX support tsnep: Add XDP socket zero-copy RX support tsnep: Move skb receive action to separate function tsnep: Add functions for queue enable/disable tsnep: Rework TX/RX queue initialization tsnep: Replace modulo operation with mask net: phy: dp83867: Add led_brightness_set support net: phy: Fix reading LED reg property drivers: nfc: nfcsim: remove return value check of `dev_dir` net: phy: dp83867: Remove unnecessary (void*) conversions net: ethtool: coalesce: try to make user settings stick twice net: mana: Check if netdev/napi_alloc_frag returns single page net: mana: Rename mana_refill_rxoob and remove some empty lines net: veth: add page_pool stats ... |
||
Linus Torvalds
|
9dd6956b38 |
for-6.4/block-2023-04-21
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvcIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpk+JEACj01t7Xen2+Razagu3aTx9tmRGFnTNR3MY raFG6B1TADk1TgCWWa2C4Dj67SOispPLm8hbIcOxqB1UscDWCCwjmnr/debADFzW Ap6shv/IRwVGmDp+F7ocYas0ynwooOJg4WJTwkSKz2o4m4p3vzlwAKi4fLiSjbXp gJTrA7WEvDOVjzajlTFUtjr8rc6PdunbGm25cPIufAxUEhvttYex2VbVqjDmfNsE 8tyyk9RWbe4AY/ZYaGXVn4yQ/CgL/sXFkVc5noRXNfAQ/K3CVLQrFLJ3JlwUHpiA xXBor21TUWCZEo33Y2G5NConAYqE7etoPTkaTDO3/aZ+dAMFyhC/WAYLz1KZGMh1 +g1fDX1QKEd40H2lfDXvqF1ob7Ut8EzUx+gvBXcc3/AiRpJ5rjfOcj6LPUMUqQJk nucLLFTiMKecnDMBERbvixqbaTyrjvkFEj2wYJvgj1LKXAd+x/bj8SGajs9r88Nb 9YT9ai/+Yl7Ppfb67rCgXJU7oNZQSAQ2H+X/l2jbiqImOgq1u/45AmINnbanS7HH Y1I8pbH45AcnCgkJRoQwrNX3BnTOTBJ+D/4Fl4b8jsihq0D3UtwCwPCObHP4LW9S MUNPhP3tUuYsAgXqX80+Sao6SYvXDwnbWOM+LOaaZXgjb1ndwDUZXpto8Ra8WB1u 8kM6s6ZR7g== =W1Zb -----END PGP SIGNATURE----- Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - drbd patches, bringing us closer to unifying the out-of-tree version and the in tree one (Andreas, Christoph) - support for auto-quiesce for the s390 dasd driver (Stefan) - MD pull request via Song: - md/bitmap: Optimal last page size (Jon Derrick) - Various raid10 fixes (Yu Kuai, Li Nan) - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk) - NVMe pull request via Christoph: - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - Validate nvmet module parameters (Chaitanya Kulkarni) - Fence TCP socket on receive error (Chris Leech) - Fix async event trace event (Keith Busch) - Minor cleanups (Chaitanya Kulkarni, zhenwei pi) - Fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - Fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - Fix irq locking in nvme-fcloop (Ming Lei) - Remove queue mapping helper for rdma devices (Sagi Grimberg) - use structured request attribute checks for nbd (Jakub) - fix blk-crypto race conditions between keyslot management (Eric) - add sed-opal support for reading read locking range attributes (Ondrej) - make fault injection configurable for null_blk (Akinobu) - clean up the request insertion API (Christoph) - clean up the queue running API (Christoph) - blkg config helper cleanups (Tejun) - lazy init support for blk-iolatency (Tejun) - various fixes and tweaks to ublk (Ming) - remove hybrid polling. It hasn't really been useful since we got async polled IO support, and these days we don't support sync polled IO at all (Keith) - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming, Chaitanya, me) * tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits) nbd: fix incomplete validation of ioctl arg ublk: don't return 0 in case of any failure sed-opal: geometry feature reporting command null_blk: Always check queue mode setting from configfs block: ublk: switch to ioctl command encoding blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush block, bfq: Fix division by zero error on zero wsum fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m block: store bdev->bd_disk->fops->submit_bio state in bdev block: re-arrange the struct block_device fields for better layout md/raid5: remove unused working_disks variable md/raid10: don't call bio_start_io_acct twice for bio which experienced read error md/raid10: fix memleak of md thread md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: don't BUG_ON() in raise_barrier() md: fix soft lockup in status_resync md: add error_handlers for raid0 and linear md: Use optimal I/O size for last bitmap page md: Fix types in sb writer ... |
||
Paolo Bonzini
|
4f382a79a6 |
KVM/arm64 updates for 6.4
- Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmRCZIwPHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDoZ8P/ioXAdDbAE4hTuyD2YdKJ3IGWN3pg52Z7xc2 rBXXFrbK9+n9FEc3AVdHoGsRPDP0Ynl+apj+aB0Klr/Fl0KKqac+W0ARX9rn1mI1 HjeygFPaGnXjMUp0BjeSLS+g3b0gebELJ6R1QEe1/MIPb8Se7M1y3ZpMWdhe0PPL vyzw3LZq2OAlLgWKZhAfhh03qdr2kqJxypYs6nMrcexfn8dXT78dsYKW1nXmqKcE 61Gg23MDPUoexYpUhm+ym5t8hltoI1di8faPmxEpaFzpSDyAg8V5vo6LiW9jn3cf RX0Sikk1laiRAhVbbIFCKC148vFyKxum3scpKyb91Qc+sK1kmIcxvEqlc6SfG9je +5ndZwAfXtW6SMSOyX8y5fXbee7M0sx3n3le9BNgwXfmLWg/GHXJ544dJgVIlf/e 0Z+8QnP1IUDfARR/b2FlW7A7XLzNHQzO379ekcAdUptbGwlS9CrW6SJ83QR7K6fB bh0aSSELKsD7pX8wnNyNACvmz2zL12ITlDKdZWUr8MSxyTjgVy7s0BDsQT3sbrA1 1sH++RvUWfC2k7tVT3vjZFzUDlPw3bnZmo5YMWRTMbXEdr1V5rDw5F5IXit13KeT 8bk0hnJgnLmyoX2A17v5dkFMIKD7p13tqDRdfFcn0ru63HIKxgkS3ITkDmsAQELK DHT7RBE0 =Bhta -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.4 - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. |
||
Paolo Bonzini
|
b3c129e33e |
Minor cleanup:
- phys_to_virt conversion - Improvement of VSIE AP management -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmRBTJoACgkQ41TmuOI4 ufiEkA/8C9Px89sI6iw1/w70mjreJeVXYpXudgDH8YhTs+4udLMJHXjBlYWmJijZ WRW7s9rHInLPXrLOx6fHrlFA+xd4giD/Ub4UHepqUCFdlhRMlykUsqS3XFfcaJ/T l72ik/Xq41fOrGuRrBzPqWckM+5/rpae6tdzuT5MQLaxwifH+IFi+w2jQl3gqnW5 3BFekbflqPPdRgsTKUfP7dhk9MFQ6caGLMvGxJsjQGEziOljrpJ9apK4zrZXhaDU tJdA+Cd6jqsIWY3oYyinXqJYB0iEcoVMW9RpP05MrdrSMseJYSZoMB/mT2OS3xPa byAibx8Iu1U7HUohotWnlLj6SM8xsl5RJ2hi2kPL+7rUNIT6azd2IJ/xcgH+cUwG wDWTRfEH51WWirBsCLd0nvjACClcj0rEwBsRgYbvfqo82TABe7xAKslNSTL4NwVV TT3Lba9jQrAUNsg+UOfNVTNwy6pmkKVc4dqSmOSUC5bxKUKJO735u/442aUjc98C eb+FuxUNU040YEe0dG970zXkyJDJbDtQ+VX88Ko0L22iWJBGNoAXj1dpgd1sQCHY fc+WcHLhP1QlIfRKCff1WHWxETcepHrQizA7pJeHXgxl50VtOzUAkc1Sc4piEfpl ul4XTHZfBwautvqRLak9zTBK3weh3UXvFrX5u8Cl8IBzYUbuGFE= =IjLn -----END PGP SIGNATURE----- Merge tag 'kvm-s390-next-6.4-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD Minor cleanup: - phys_to_virt conversion - Improvement of VSIE AP management |
||
Linus Torvalds
|
df45da57cb |
arm64 updates for 6.4
ACPI: * Improve error reporting when failing to manage SDEI on AGDI device removal Assembly routines: * Improve register constraints so that the compiler can make use of the zero register instead of moving an immediate #0 into a GPR * Allow the compiler to allocate the registers used for CAS instructions CPU features and system registers: * Cleanups to the way in which CPU features are identified from the ID register fields * Extend system register definition generation to handle Enum types when defining shared register fields * Generate definitions for new _EL2 registers and add new fields for ID_AA64PFR1_EL1 * Allow SVE to be disabled separately from SME on the kernel command-line Tracing: * Support for "direct calls" in ftrace, which enables BPF tracing for arm64 Kdump: * Don't bother unmapping the crashkernel from the linear mapping, which then allows us to use huge (block) mappings and reduce TLB pressure when a crashkernel is loaded. Memory management: * Try again to remove data cache invalidation from the coherent DMA allocation path * Simplify the fixmap code by mapping at page granularity * Allow the kfence pool to be allocated early, preventing the rest of the linear mapping from being forced to page granularity Perf and PMU: * Move CPU PMU code out to drivers/perf/ where it can be reused by the 32-bit ARM architecture when running on ARMv8 CPUs * Fix race between CPU PMU probing and pKVM host de-privilege * Add support for Apple M2 CPU PMU * Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event dynamically, depending on what the CPU actually supports * Minor fixes and cleanups to system PMU drivers Stack tracing: * Use the XPACLRI instruction to strip PAC from pointers, rather than rolling our own function in C * Remove redundant PAC removal for toolchains that handle this in their builtins * Make backtracing more resilient in the face of instrumentation Miscellaneous: * Fix single-step with KGDB * Remove harmless warning when 'nokaslr' is passed on the kernel command-line * Minor fixes and cleanups across the board -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmRChcwQHHdpbGxAa2Vy bmVsLm9yZwAKCRC3rHDchMFjNCgBCADFvkYY9ESztSnd3EpiMbbAzgRCQBiA5H7U F2Wc+hIWgeAeUEttSH22+F16r6Jb0gbaDvsuhtN2W/rwQhKNbCU0MaUME05MPmg2 AOp+RZb2vdT5i5S5dC6ZM6G3T6u9O78LBWv2JWBdd6RIybamEn+RL00ep2WAduH7 n1FgTbsKgnbScD2qd4K1ejZ1W/BQMwYulkNpyTsmCIijXM12lkzFlxWnMtky3uhR POpawcIZzXvWI02QAX+SIdynGChQV3VP+dh9GuFbt7ASigDEhgunvfUYhZNSaqf4 +/q0O8toCtmQJBUhF0DEDSB5T8SOz5v9CKxKuwfaX6Trq0ixFQpZ =78L9 -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "ACPI: - Improve error reporting when failing to manage SDEI on AGDI device removal Assembly routines: - Improve register constraints so that the compiler can make use of the zero register instead of moving an immediate #0 into a GPR - Allow the compiler to allocate the registers used for CAS instructions CPU features and system registers: - Cleanups to the way in which CPU features are identified from the ID register fields - Extend system register definition generation to handle Enum types when defining shared register fields - Generate definitions for new _EL2 registers and add new fields for ID_AA64PFR1_EL1 - Allow SVE to be disabled separately from SME on the kernel command-line Tracing: - Support for "direct calls" in ftrace, which enables BPF tracing for arm64 Kdump: - Don't bother unmapping the crashkernel from the linear mapping, which then allows us to use huge (block) mappings and reduce TLB pressure when a crashkernel is loaded. Memory management: - Try again to remove data cache invalidation from the coherent DMA allocation path - Simplify the fixmap code by mapping at page granularity - Allow the kfence pool to be allocated early, preventing the rest of the linear mapping from being forced to page granularity Perf and PMU: - Move CPU PMU code out to drivers/perf/ where it can be reused by the 32-bit ARM architecture when running on ARMv8 CPUs - Fix race between CPU PMU probing and pKVM host de-privilege - Add support for Apple M2 CPU PMU - Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event dynamically, depending on what the CPU actually supports - Minor fixes and cleanups to system PMU drivers Stack tracing: - Use the XPACLRI instruction to strip PAC from pointers, rather than rolling our own function in C - Remove redundant PAC removal for toolchains that handle this in their builtins - Make backtracing more resilient in the face of instrumentation Miscellaneous: - Fix single-step with KGDB - Remove harmless warning when 'nokaslr' is passed on the kernel command-line - Minor fixes and cleanups across the board" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits) KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege arm64: kexec: include reboot.h arm64: delete dead code in this_cpu_set_vectors() arm64/cpufeature: Use helper macro to specify ID register for capabilites drivers/perf: hisi: add NULL check for name drivers/perf: hisi: Remove redundant initialized of pmu->name arm64/cpufeature: Consistently use symbolic constants for min_field_value arm64/cpufeature: Pull out helper for CPUID register definitions arm64/sysreg: Convert HFGITR_EL2 to automatic generation ACPI: AGDI: Improve error reporting for problems during .remove() arm64: kernel: Fix kernel warning when nokaslr is passed to commandline perf/arm-cmn: Fix port detection for CMN-700 arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step arm64: move PAC masks to <asm/pointer_auth.h> arm64: use XPACLRI to strip PAC arm64: avoid redundant PAC stripping in __builtin_return_address() arm64/sme: Fix some comments of ARM SME arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2() arm64/signal: Use system_supports_tpidr2() to check TPIDR2 arm64/idreg: Don't disable SME when disabling SVE ... |
||
Linus Torvalds
|
e7989789c6 |
Timers and timekeeping updates:
- Improve the VDSO build time checks to cover all dynamic relocations VDSO does not allow dynamic relcations, but the build time check is incomplete and fragile. It's based on architectures specifying the relocation types to search for and does not handle R_*_NONE relocation entries correctly. R_*_NONE relocations are injected by some GNU ld variants if they fail to determine the exact .rel[a]/dyn_size to cover trailing zeros. R_*_NONE relocations must be ignored by dynamic loaders, so they should be ignored in the build time check too. Remove the architecture specific relocation types to check for and validate strictly that no other relocations than R_*_NONE end up in the VSDO .so file. - Prefer signal delivery to the current thread for CLOCK_PROCESS_CPUTIME_ID based posix-timers Such timers prefer to deliver the signal to the main thread of a process even if the context in which the timer expires is the current task. This has the downside that it might wake up an idle thread. As there is no requirement or guarantee that the signal has to be delivered to the main thread, avoid this by preferring the current task if it is part of the thread group which shares sighand. This not only avoids waking idle threads, it also distributes the signal delivery in case of multiple timers firing in the context of different threads close to each other better. - Align the tick period properly (again) For a long time the tick was starting at CLOCK_MONOTONIC zero, which allowed users space applications to either align with the tick or to place a periodic computation so that it does not interfere with the tick. The alignement of the tick period was more by chance than by intention as the tick is set up before a high resolution clocksource is installed, i.e. timekeeping is still tick based and the tick period advances from there. The early enablement of sched_clock() broke this alignement as the time accumulated by sched_clock() is taken into account when timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is not longer a multiple of tick periods, which breaks applications which relied on that behaviour. Cure this by aligning the tick starting point to the next multiple of tick periods, i.e 1000ms/CONFIG_HZ. - A set of NOHZ fixes and enhancements - Cure the concurrent writer race for idle and IO sleeptime statistics The statitic values which are exposed via /proc/stat are updated from the CPU local idle exit and remotely by cpufreq, but that happens without any form of serialization. As a consequence sleeptimes can be accounted twice or worse. Prevent this by restricting the accumulation writeback to the CPU local idle exit and let the remote access compute the accumulated value. - Protect idle/iowait sleep time with a sequence count Reading idle/iowait sleep time, e.g. from /proc/stat, can race with idle exit updates. As a consequence the readout may result in random and potentially going backwards values. Protect this by a sequence count, which fixes the idle time statistics issue, but cannot fix the iowait time problem because iowait time accounting races with remote wake ups decrementing the remote runqueues nr_iowait counter. The latter is impossible to fix, so the only way to deal with that is to document it properly and to remove the assertion in the selftest which triggers occasionally due to that. - Restructure struct tick_sched for better cache layout - Some small cleanups and a better cache layout for struct tick_sched - Implement the missing timer_wait_running() callback for POSIX CPU timers For unknown reason the introduction of the timer_wait_running() callback missed to fixup posix CPU timers, which went unnoticed for almost four years. While initially only targeted to prevent livelocks between a timer deletion and the timer expiry function on PREEMPT_RT enabled kernels, it turned out that fixing this for mainline is not as trivial as just implementing a stub similar to the hrtimer/timer callbacks. The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems there is a livelock issue independent of RT. CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers out from hard interrupt context to task work, which is handled before returning to user space or to a VM. The expiry mechanism moves the expired timers to a stack local list head with sighand lock held. Once sighand is dropped the task can be preempted and a task which wants to delete a timer will spin-wait until the expiry task is scheduled back in. In the worst case this will end up in a livelock when the preempting task and the expiry task are pinned on the same CPU. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way. Add a per task mutex to struct posix_cputimers_work, let the expiry task hold it accross the expiry function and let the deleting task which waits for the expiry to complete block on the mutex. In the non-contended case this results in an extra mutex_lock()/unlock() pair on both sides. This avoids spin-waiting on a task which is scheduled out, prevents the livelock and cures the problem for RT and !RT systems. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRGrj4THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoZhdEAC/lwfDWCnTXHC8ExQQRDIVNyXmDlLb EHB8ZY7Wc4gNZ8UEXEOLOXJHMG9bsbtPGctVewJwRGnXZWKVhpPwQba6kCRycyX0 0J6l5DlvUaGGrpoOzOZwgETRmtIZE9tEArZR8xlfRScYd93a7yLhwIjO8JaV9vKs IQpAQMeJ/ysp6gHrS59qakYfoHU/ERUAu3Tk4GqHUtPtcyz3nX3eTlLWV8LySqs+ 00qr2yc0bQFUFoKzTCxtM8lcEi9ja9SOj1rw28348O+BXE4d0HC12Ie7eU/CDN2Y OAlWYxVjy4LMh24LDrRQKTzoVqx9MXDx2g+09B3t8NK5LgeS+EJIjujDhZF147/H 5y906nplZUKa8BiZW5Rpm/HKH8tFI80T9XWSQCRBeMgTEJyRyRU1yASAwO4xw+dY Dn3tGmFGymcV/72o4ic9JFKQd8cTSxPjEJS3qqzMkEAtyI/zPBmKxj/Tce50OH40 6FSZq1uU21ZQzszwSHISwgFtNr75laUSK4Z1te5OhPOOz+C7O9YqHvqS/1jwhPj2 tMd8X17fRW3UTUBlBj+zqxqiEGBl/Yk2AvKrJIXGUtfWYCtjMJ7ieCf0kZ7NSVJx 9ewubA0gqseMD783YomZsy8LLtMKnhclJeslUOVb1oKs1q/WF1R/k6qjy9vUwYaB nIJuHl8mxSetag== =SVnj -----END PGP SIGNATURE----- Merge tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: - Improve the VDSO build time checks to cover all dynamic relocations VDSO does not allow dynamic relocations, but the build time check is incomplete and fragile. It's based on architectures specifying the relocation types to search for and does not handle R_*_NONE relocation entries correctly. R_*_NONE relocations are injected by some GNU ld variants if they fail to determine the exact .rel[a]/dyn_size to cover trailing zeros. R_*_NONE relocations must be ignored by dynamic loaders, so they should be ignored in the build time check too. Remove the architecture specific relocation types to check for and validate strictly that no other relocations than R_*_NONE end up in the VSDO .so file. - Prefer signal delivery to the current thread for CLOCK_PROCESS_CPUTIME_ID based posix-timers Such timers prefer to deliver the signal to the main thread of a process even if the context in which the timer expires is the current task. This has the downside that it might wake up an idle thread. As there is no requirement or guarantee that the signal has to be delivered to the main thread, avoid this by preferring the current task if it is part of the thread group which shares sighand. This not only avoids waking idle threads, it also distributes the signal delivery in case of multiple timers firing in the context of different threads close to each other better. - Align the tick period properly (again) For a long time the tick was starting at CLOCK_MONOTONIC zero, which allowed users space applications to either align with the tick or to place a periodic computation so that it does not interfere with the tick. The alignement of the tick period was more by chance than by intention as the tick is set up before a high resolution clocksource is installed, i.e. timekeeping is still tick based and the tick period advances from there. The early enablement of sched_clock() broke this alignement as the time accumulated by sched_clock() is taken into account when timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is not longer a multiple of tick periods, which breaks applications which relied on that behaviour. Cure this by aligning the tick starting point to the next multiple of tick periods, i.e 1000ms/CONFIG_HZ. - A set of NOHZ fixes and enhancements: * Cure the concurrent writer race for idle and IO sleeptime statistics The statitic values which are exposed via /proc/stat are updated from the CPU local idle exit and remotely by cpufreq, but that happens without any form of serialization. As a consequence sleeptimes can be accounted twice or worse. Prevent this by restricting the accumulation writeback to the CPU local idle exit and let the remote access compute the accumulated value. * Protect idle/iowait sleep time with a sequence count Reading idle/iowait sleep time, e.g. from /proc/stat, can race with idle exit updates. As a consequence the readout may result in random and potentially going backwards values. Protect this by a sequence count, which fixes the idle time statistics issue, but cannot fix the iowait time problem because iowait time accounting races with remote wake ups decrementing the remote runqueues nr_iowait counter. The latter is impossible to fix, so the only way to deal with that is to document it properly and to remove the assertion in the selftest which triggers occasionally due to that. * Restructure struct tick_sched for better cache layout * Some small cleanups and a better cache layout for struct tick_sched - Implement the missing timer_wait_running() callback for POSIX CPU timers For unknown reason the introduction of the timer_wait_running() callback missed to fixup posix CPU timers, which went unnoticed for almost four years. While initially only targeted to prevent livelocks between a timer deletion and the timer expiry function on PREEMPT_RT enabled kernels, it turned out that fixing this for mainline is not as trivial as just implementing a stub similar to the hrtimer/timer callbacks. The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems there is a livelock issue independent of RT. CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers out from hard interrupt context to task work, which is handled before returning to user space or to a VM. The expiry mechanism moves the expired timers to a stack local list head with sighand lock held. Once sighand is dropped the task can be preempted and a task which wants to delete a timer will spin-wait until the expiry task is scheduled back in. In the worst case this will end up in a livelock when the preempting task and the expiry task are pinned on the same CPU. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way. Add a per task mutex to struct posix_cputimers_work, let the expiry task hold it accross the expiry function and let the deleting task which waits for the expiry to complete block on the mutex. In the non-contended case this results in an extra mutex_lock()/unlock() pair on both sides. This avoids spin-waiting on a task which is scheduled out, prevents the livelock and cures the problem for RT and !RT systems * tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Implement the missing timer_wait_running callback selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity selftests/proc: Remove idle time monotonicity assertions MAINTAINERS: Remove stale email address timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick() timers/nohz: Add a comment about broken iowait counter update race timers/nohz: Protect idle/iowait sleep time under seqcount timers/nohz: Only ever update sleeptime from idle exit timers/nohz: Restructure and reshuffle struct tick_sched tick/common: Align tick period with the HZ tick. selftests/timers/posix_timers: Test delivery of signals across threads posix-timers: Prefer delivery of signals to the current thread vdso: Improve cmd_vdso_check to check all dynamic relocations |
||
Linus Torvalds
|
5dfb75e842 |
RCU Changes for 6.4:
o MAINTAINERS files additions and changes. o Fix hotplug warning in nohz code. o Tick dependency changes by Zqiang. o Lazy-RCU shrinker fixes by Zqiang. o rcu-tasks stall reporting improvements by Neeraj. o Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep() name for robustness. o Documentation Updates: o Significant changes to srcu_struct size. o Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun. o rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree. o Other misc changes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4 5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2 96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+ XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/ +Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe 93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq 4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A= =Rgbj -----END PGP SIGNATURE----- Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ... |
||
Jakub Kicinski
|
9a82cdc28f |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZELn8wAKCRDbK58LschI g1khAQC1nmXPuKjM4EAfFK8Ysb3KoF8ADmpE97n+/HEDydCagwD/bX0+NABR75Nh ueGcoU1TcfcbshDzrH0s+C95owZDZw4= =BeZM -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-04-21 We've added 71 non-merge commits during the last 8 day(s) which contain a total of 116 files changed, 13397 insertions(+), 8896 deletions(-). The main changes are: 1) Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward, from Florian Westphal. 2) Fix race between btf_put and btf_idr walk which caused a deadlock, from Alexei Starovoitov. 3) Second big batch to migrate test_verifier unit tests into test_progs for ease of readability and debugging, from Eduard Zingerman. 4) Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree, from Dave Marchevsky. 5) Migrate bpf_for(), bpf_for_each() and bpf_repeat() macros from BPF selftests into libbpf-provided bpf_helpers.h header and improve kfunc handling, from Andrii Nakryiko. 6) Support 64-bit pointers to kfuncs needed for archs like s390x, from Ilya Leoshkevich. 7) Support BPF progs under getsockopt with a NULL optval, from Stanislav Fomichev. 8) Improve verifier u32 scalar equality checking in order to enable LLVM transformations which earlier had to be disabled specifically for BPF backend, from Yonghong Song. 9) Extend bpftool's struct_ops object loading to support links, from Kui-Feng Lee. 10) Add xsk selftest follow-up fixes for hugepage allocated umem, from Magnus Karlsson. 11) Support BPF redirects from tc BPF to ifb devices, from Daniel Borkmann. 12) Add BPF support for integer type when accessing variable length arrays, from Feng Zhou. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (71 commits) selftests/bpf: verifier/value_ptr_arith converted to inline assembly selftests/bpf: verifier/value_illegal_alu converted to inline assembly selftests/bpf: verifier/unpriv converted to inline assembly selftests/bpf: verifier/subreg converted to inline assembly selftests/bpf: verifier/spin_lock converted to inline assembly selftests/bpf: verifier/sock converted to inline assembly selftests/bpf: verifier/search_pruning converted to inline assembly selftests/bpf: verifier/runtime_jit converted to inline assembly selftests/bpf: verifier/regalloc converted to inline assembly selftests/bpf: verifier/ref_tracking converted to inline assembly selftests/bpf: verifier/map_ptr_mixing converted to inline assembly selftests/bpf: verifier/map_in_map converted to inline assembly selftests/bpf: verifier/lwt converted to inline assembly selftests/bpf: verifier/loops1 converted to inline assembly selftests/bpf: verifier/jeq_infer_not_null converted to inline assembly selftests/bpf: verifier/direct_packet_access converted to inline assembly selftests/bpf: verifier/d_path converted to inline assembly selftests/bpf: verifier/ctx converted to inline assembly selftests/bpf: verifier/btf_ctx_access converted to inline assembly selftests/bpf: verifier/bpf_get_stack converted to inline assembly ... ==================== Link: https://lore.kernel.org/r/20230421211035.9111-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Linus Torvalds
|
6b008640db |
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
Instead of having callers care about the mmap_min_addr logic for the
lowest valid mapping address (and some of them getting it wrong), just
move the logic into vm_unmapped_area() itself. One less thing for various
architecture cases (and generic helpers) to worry about.
We should really try to make much more of this be common code, but baby
steps..
Without this, vm_unmapped_area() could return an address below
mmap_min_addr (because some caller forgot about that). That then causes
the mmap machinery to think it has found a workable address, but then
later security_mmap_addr(addr) is unhappy about it and the mmap() returns
with a nonsensical error (EPERM).
The proper action is to either return ENOMEM (if the virtual address space
is exhausted), or try to find another address (ie do a bottom-up search
for free addresses after the top-down one failed).
See commit
|
||
Stefan Roesch
|
d7597f59d1 |
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Nico Boehr
|
8a46df7cd1 |
KVM: s390: pci: fix virtual-physical confusion on module unload/load
When the kvm module is unloaded, zpci_setup_aipb() perists some data in the zpci_aipb structure in s390 pci code. Note that this struct is also passed to firmware in the zpci_set_irq_ctrl() call and thus the GAIT must be a physical address. On module re-insertion, the GAIT is restored from this structure in zpci_reset_aipb(). But it is a physical address, hence this may cause issues when the kvm module is unloaded and loaded again. Fix virtual vs physical address confusion (which currently are the same) by adding the necessary physical-to-virtual-conversion in zpci_reset_aipb(). Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230222155503.43399-1-nrb@linux.ibm.com Message-Id: <20230222155503.43399-1-nrb@linux.ibm.com> |
||
Pierre Morel
|
7be3e33923 |
KVM: s390: vsie: clarifications on setting the APCB
The APCB is part of the CRYCB. The calculation of the APCB origin can be done by adding the APCB offset to the CRYCB origin. Current code makes confusing transformations, converting the CRYCB origin to a pointer to calculate the APCB origin. Let's make things simpler and keep the CRYCB origin to make these calculations. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230214122841.13066-2-pmorel@linux.ibm.com Message-Id: <20230214122841.13066-2-pmorel@linux.ibm.com> |
||
Nico Boehr
|
2f2c0911b9 |
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
We sometimes put a virtual address in next_alert, which should always be a physical address, since it is shared with hardware. This currently works, because virtual and physical addresses are the same. Add phys_to_virt() to resolve the virtual-physical confusion. Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20230223162236.51569-1-nrb@linux.ibm.com Message-Id: <20230223162236.51569-1-nrb@linux.ibm.com> |
||
Heiko Carstens
|
2a405f6bb3 |
s390/stackleak: provide fast __stackleak_poison() implementation
Provide an s390 specific __stackleak_poison() implementation which is
faster than the generic variant.
For the original implementation with an enforced 4kb stackframe for the
getpid() system call the system call overhead increases by a factor of 3 if
the stackleak feature is enabled. Using the s390 mvc based variant this is
reduced to an increase of 25% instead.
This is within the expected area, since the mvc based implementation is
more or less a memset64() variant which comes with similar results. See
commit
|
||
Heiko Carstens
|
ccf7c3fb61 |
s390: select ARCH_USE_SYM_ANNOTATIONS
All old style assembly annotations have been converted for s390. Select ARCH_USE_SYM_ANNOTATIONS to make sure the old macros like ENTRY() aren't available anymore. This prevents that new code which uses the old macros will be added again. This follows what has been done for x86 with commit |
||
Heiko Carstens
|
34e4c79f3b |
s390/mm: use VM_FLUSH_RESET_PERMS in module_alloc()
Make use of the set_direct_map() calls for module allocations. In particular: - All changes to read-only permissions in kernel VA mappings are also applied to the direct mapping. Note that execute permissions are intentionally not applied to the direct mapping in order to make sure that all allocated pages within the direct mapping stay non-executable - module_alloc() passes the VM_FLUSH_RESET_PERMS to __vmalloc_node_range() to make sure that all implicit permission changes made to the direct mapping are reset when the allocated vm area is freed again Side effects: the direct mapping will be fragmented depending on how many vm areas with VM_FLUSH_RESET_PERMS and/or explicit page permission changes are allocated and freed again. For example, just after boot of a system the direct mapping statistics look like: $cat /proc/meminfo ... DirectMap4k: 111628 kB DirectMap1M: 16665600 kB DirectMap2G: 0 kB Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
7608f70adc |
s390: wire up memfd_secret system call
s390 supports ARCH_HAS_SET_DIRECT_MAP, therefore wire up the memfd_secret system call, which depends on it. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
0490d6d7ba |
s390/mm: enable ARCH_HAS_SET_DIRECT_MAP
Implement the set_direct_map_*() API, which allows to invalidate and set default permissions to pages within the direct mapping. Note that kernel_page_present(), which is also supposed to be part of this API, is intentionally not implemented. The reason for this is that kernel_page_present() is only used (and currently only makes sense) for suspend/resume, which isn't supported on s390. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
17c51b1ba9 |
s390/mm: use BIT macro to generate SET_MEMORY bit masks
Use BIT macro to generate SET_MEMORY bit masks, which is easier to maintain if bits get added, or removed. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
0ae241f4d7 |
s390/relocate_kernel: adjust indentation
relocate_kernel.S seems to be the only assembler file which doesn't follow the standard way of indentation. Adjust this for the sake of consistency. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
680957b3b8 |
s390/relocate_kernel: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
fda1dffa44 |
s390/entry: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
04b6d02dbe |
s390/purgatory: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
6cea5f0bc9 |
s390/kprobes: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
26d1429922 |
s390/reipl: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
05d0935d12 |
s390/head64: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
a89d60fc7a |
s390/earlypgm: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
aaaac068f0 |
s390/mcount: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
b5f3c99d15 |
s390/crc32le: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
4b788ac8ed |
s390/crc32be: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
3e5e5107b7 |
s390/crypto,chacha: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Acked-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
ac0c06a1dc |
s390/amode31: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
45769052ae |
s390/lib: use SYM* macros instead of ENTRY(), etc.
Consistently use the SYM* family of macros instead of the deprecated ENTRY(), ENDPROC(), etc. family of macros. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
e48b6853d8 |
s390/kasan: remove override of mem*() functions
The kasan mem*() functions are not used anymore since s390 has switched
to GENERIC_ENTRY and commit
|