mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
synced 2025-01-08 15:04:45 +00:00
89b933a201
9881 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Ryan Roberts
|
3a5a8d343e |
mm: fix race between __split_huge_pmd_locked() and GUP-fast
__split_huge_pmd_locked() can be called for a present THP, devmap or
(non-present) migration entry. It calls pmdp_invalidate() unconditionally
on the pmdp and only determines if it is present or not based on the
returned old pmd. This is a problem for the migration entry case because
pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a
present pmd.
On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future
call to pmd_present() will return true. And therefore any lockless
pgtable walker could see the migration entry pmd in this state and start
interpretting the fields as if it were present, leading to BadThings (TM).
GUP-fast appears to be one such lockless pgtable walker.
x86 does not suffer the above problem, but instead pmd_mkinvalid() will
corrupt the offset field of the swap entry within the swap pte. See link
below for discussion of that problem.
Fix all of this by only calling pmdp_invalidate() for a present pmd. And
for good measure let's add a warning to all implementations of
pmdp_invalidate[_ad](). I've manually reviewed all other
pmdp_invalidate[_ad]() call sites and believe all others to be conformant.
This is a theoretical bug found during code review. I don't have any test
case to trigger it in practice.
Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com
Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/
Fixes:
|
||
David Hildenbrand
|
29ae7d96d1 |
mm: pass VMA instead of MM to follow_pte()
... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll now also perform these sanity checks for direct follow_pte() invocations. For generic_access_phys(), we might now check multiple times: nothing to worry about, really. Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> [KVM] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Alexander Gordeev
|
1812dc9c33 |
Revert "s390: Relocate vmlinux ELF data to virtual address space"
This reverts commit
|
||
Alexander Gordeev
|
22a49f6d30 |
Merge branch 'shared-zeropage' into features
David Hildenbrand says: =================== This series fixes one issue with uffd + shared zeropages on s390x and fixes that "ordinary" KVM guests can make use of shared zeropages again. userfaultfd could currently end up mapping shared zeropages into processes that forbid shared zeropages. This only apples to s390x, relevant for handling PV guests and guests that use storage kets correctly. Fix it by placing a zeroed folio instead of the shared zeropage during UFFDIO_ZEROPAGE instead. I stumbled over this issue while looking into a customer scenario that is using: (1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB available and additional memory can be "fake hotplugged" to the VM later on demand by deflating the balloon. Actual memory overcommit is not desired, so physical memory would only be moved between VMs. (2) Live migration of VMs between sites to evacuate servers in case of emergency. Without the shared zeropage, during (2), the VM would suddenly consume 100 GiB on the migration source and destination. On the migration source, where we don't excpect memory overcommit, we could easilt end up crashing the VM during migration. Independent of that, memory handed back to the hypervisor using "free page reporting" would end up consuming actual memory after the migration on the destination, not getting freed up until reused+freed again. While there might be ways to optimize parts of this in QEMU, we really should just support the shared zeropage again for ordinary VMs. We only expect legcy guests to make use of storage keys, so let's handle zeropages again when enabling storage keys or when enabling PV. To not break userfaultfd like we did in the past, don't zap the shared zeropages, but instead trigger unsharing faults, just like we do for unsharing KSM pages in break_ksm(). Unsharing faults will simply replace the shared zeropage by a zeroed anonymous folio. We can already trigger the same fault path using GUP, when trying to long-term pin a shared zeropage, but also when unmerging a KSM-placed zeropages, so this is nothing new. Patch #1 tested on 86-64 by forcing mm_forbids_zeropage() to be 1, and running the uffd selftests. Patch #2 tested on s390x: the live migration scenario now works as expected, and kvm-unit-tests that trigger usage of skeys work well, whereby I can see detection and unsharing of shared zeropages. Further (as broken in v2), I tested that the shared zeropage is no longer populated after skeys are used -- that mm_forbids_zeropage() works as expected: ./s390x-run s390x/skey.elf \ -no-shutdown \ -chardev socket,id=monitor,path=/var/tmp/mon,server,nowait \ -mon chardev=monitor,mode=readline Then, in another shell: # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss Rss: 31484 kB # echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon ... # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss Rss: 160452 kB -> Reading guest memory does not populate the shared zeropage Doing the same with selftest.elf (no skeys) # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss Rss: 30900 kB # echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon ... # cat /proc/`pgrep qemu`/smaps_rollup | grep Rsstmp/mon Rss: 30924 kB -> Reading guest memory does populate the shared zeropage =================== Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Masahiro Yamada
|
b957df3b85 |
arch: use $(obj)/ instead of $(src)/ for preprocessed linker scripts
These are generated files. Prefix them with $(obj)/ instead of $(src)/. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Helge Deller <deller@gmx.de> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> |
||
Jean-Philippe Brucker
|
175f2f5bcd |
KVM: s390: Check kvm pointer when testing KVM_CAP_S390_HPAGE_1M
KVM allows issuing the KVM_CHECK_EXTENSION ioctl either on the /dev/kvm
fd or the VM fd. In the first case, kvm_vm_ioctl_check_extension() is
called with kvm==NULL. Ensure we don't dereference the pointer in that
case.
Fixes:
|
||
Harald Freudenberger
|
7bbe449d0b |
s390/paes: Reestablish retry loop in paes
With commit |
||
Nina Schoetterl-Glausch
|
cc4edb92f5 |
KVM: s390: vsie: Use virt_to_phys for crypto control block
The address of the crypto control block in the (shadow) SIE block is absolute/physical. Convert from virtual to physical when shadowing the guest's control block during VSIE. Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20240429171512.879215-1-nsg@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
9ecaa2e94e |
s390: Relocate vmlinux ELF data to virtual address space
Currently kernel image relocation tables and other ELF data are set to base zero. Since kernel virtual and physical address spaces are uncoupled the kernel is mapped at the top of the virtual address space, hence making the information contained in vmlinux ELF tables inconsistent. That does not pose any issue with regard to the kernel booting and operation, but makes it difficult to use a generated vmlinux with some debugging tools (e.g. gdb). Relocate vmlinux image base address from zero to a base address in the virtual address space. It is the address that kernel is mapped to in cases KASLR is disabled. The vmlinux ELF header before and after this change looks like this: Elf file type is EXEC (Executable file) Entry point 0x100000 There are 3 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000001000 0x0000000000100000 0x0000000000100000 0x0000000001323378 0x0000000001323378 R E 0x1000 LOAD 0x0000000001325000 0x0000000001424000 0x0000000001424000 0x00000000003a4200 0x000000000048fdb8 RWE 0x1000 NOTE 0x00000000012a33b0 0x00000000013a23b0 0x00000000013a23b0 0x0000000000000054 0x0000000000000054 0x4 Elf file type is EXEC (Executable file) Entry point 0x3ffe0000000 There are 3 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000001000 0x000003ffe0000000 0x000003ffe0000000 0x0000000001323378 0x0000000001323378 R E 0x1000 LOAD 0x0000000001325000 0x000003ffe1324000 0x000003ffe1324000 0x00000000003a4200 0x000000000048fdb8 RWE 0x1000 NOTE 0x00000000012a33b0 0x000003ffe12a23b0 0x000003ffe12a23b0 0x0000000000000054 0x0000000000000054 0x4 Suggested-by: Vasily Gorbik <gor@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Sumanth Korikkar
|
00cda11d3b |
s390: Compile kernel with -fPIC and link with -no-pie
When the kernel is built with CONFIG_PIE_BUILD option enabled it uses dynamic symbols, for which the linker does not allow more than 64K number of entries. This can break features like kpatch. Hence, whenever possible the kernel is built with CONFIG_PIE_BUILD option disabled. For that support of unaligned symbols generated by linker scripts in the compiler is necessary. However, older compilers might lack such support. In that case the build process resorts to CONFIG_PIE_BUILD option-enabled build. Compile object files with -fPIC option and then link the kernel binary with -no-pie linker option. As result, the dynamic symbols are not generated and not only kpatch feature succeeds, but also the whole CONFIG_PIE_BUILD option-enabled code could be dropped. [ agordeev: Reworded the commit message ] Suggested-by: Ulrich Weigand <ulrich.weigand@de.ibm.com> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Sumanth Korikkar
|
5f90003f09 |
s390: vmlinux.lds.S: Drop .hash and .gnu.hash for !CONFIG_PIE_BUILD
Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
option is enabled. Drop these for the case CONFIG_PIE_BUILD is disabled.
[ agordeev: Reworded the commit message ]
Fixes:
|
||
Sven Schnelle
|
cae74ba8c2 |
s390/ftrace: Use unwinder instead of __builtin_return_address()
Using __builtin_return_address(n) might return undefined values when used with values of n outside of the stack. This was noticed when __builtin_return_address() was called in ftrace on top level functions like the interrupt handlers. As this behaviour cannot be fixed, use the s390 stack unwinder and remove the ftrace compilation flags for unwind_bc.c and stacktrace.c to prevent the unwinding function polluting function traces. Another advantage is that this also works with clang. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Jean Delvare
|
9679fec2ca |
s390/pci: Drop unneeded reference to CONFIG_DMI
The S/390 architecture doesn't support SMBIOS, so CONFIG_DMI will never be defined there. So we can simply omit these preprocessing directives and speed up the build a bit. Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Acked-by: Niklas Schnelle <schnelle@linux.ibm.com> Link: https://lore.kernel.org/r/20240423162724.3966265a@endymion.delvare Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Sven Schnelle
|
fe742c08f3 |
s390/os_info: Fix array size in struct os_info
gcc's -Warray-bounds warned about an out-of-bounds access to
the entry array contained in struct os_info. This doesn't trigger
a bug right now because there's a large reserved space after the
array. Nevertheless fix this, and also add a BUILD_BUG_ON to make
sure struct os_info is always exactly on page in size.
Fixes:
|
||
Alexander Egorenkov
|
a2269a66ee |
s390/os_info: Initialize old os_info in standalone dump kernel
The commit be42660d0c13 ("s390/crash: use old os_info to create PT_LOAD headers")
introduced use of the old os_info into standalone dump kernel.
Before this change os_info_old_init() expected to be called only from
a regular kdump kernel although the function itself is able to work
in standalone dump kernels as well (because copy_oldmem_kernel() is able
to handle both use cases). Therefore, fix the expectation of os_info_old_init()
and enable it to be called from a standalone dump kernel.
Fixes:
|
||
Jens Remus
|
b961ec10b9 |
s390/vdso: Add CFI for RA register to asm macro vdso_func
The return-address (RA) register r14 is specified as volatile in the
s390x ELF ABI [1]. Nevertheless proper CFI directives must be provided
for an unwinder to restore the return address, if the RA register
value is changed from its value at function entry, as it is the case.
[1]: s390x ELF ABI, https://github.com/IBM/s390x-abi/releases
Fixes:
|
||
David Hildenbrand
|
25176ad09c |
mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST
Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/20240402125516.223131-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
82b7a61839 |
s390: mm: accelerate pagefault when badaccess
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Rick Edgecombe
|
b80fa3cbb7 |
treewide: use initializer for struct vm_unmapped_area_info
Future changes will need to add a new member to struct vm_unmapped_area_info. This would cause trouble for any call site that doesn't initialize the struct. Currently every caller sets each member manually, so if new ones are added they will be uninitialized and the core code parsing the struct will see garbage in the new member. It could be possible to initialize the new member manually to 0 at each call site. This and a couple other options were discussed. Having some struct vm_unmapped_area_info instances not zero initialized will put those sites at risk of feeding garbage into vm_unmapped_area(), if the convention is to zero initialize the struct and any new field addition missed a call site that initializes each field manually. So it is useful to do things similar across the kernel. The consensus (see links) was that in general the best way to accomplish taking into account both code cleanliness and minimizing the chance of introducing bugs, was to do C99 static initialization. As in: struct vm_unmapped_area_info info = {}; With this method of initialization, the whole struct will be zero initialized, and any statements setting fields to zero will be unneeded. The change should not leave cleanup at the call sides. While iterating though the possible solutions a few archs kindly acked other variations that still zero initialized the struct. These sites have been modified in previous changes using the pattern acked by the respective arch. So to be reduce the chance of bugs via uninitialized fields, perform a tree wide change using the consensus for the best general way to do this change. Use C99 static initializing to zero the struct and remove and statements that simply set members to zero. Link: https://lkml.kernel.org/r/20240326021656.202649-11-rick.p.edgecombe@intel.com Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/ Link: https://lore.kernel.org/lkml/ec3e377a-c0a0-4dd3-9cb9-96517e54d17e@csgroup.eu/ Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Rick Edgecombe
|
529ce23a76 |
mm: switch mm->get_unmapped_area() to a flag
The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
35a76f5c08 |
mm/arch: provide pud_pfn() fallback
The comment in the code explains the reasons. We took a different approach comparing to pmd_pfn() by providing a fallback function. Another option is to provide some lower level config options (compare to HUGETLB_PAGE or THP) to identify which layer an arch can support for such huge mappings. However that can be an overkill. [peterx@redhat.com: fix loongson defconfig] Link: https://lkml.kernel.org/r/20240403013249.1418299-4-peterx@redhat.com Link: https://lkml.kernel.org/r/20240327152332.950956-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
51718e25c5 |
mm: convert arch_clear_hugepage_flags to take a folio
All implementations that aren't no-ops just set a bit in the flags, and we want to use the folio flags rather than the page flags for that. Rename it to arch_clear_hugetlb_flags() while we're touching it so nobody thinks it's used for THP. [willy@infradead.org: fix arm64 build] Link: https://lkml.kernel.org/r/ZgQvNKGdlDkwhQEX@casper.infradead.org Link: https://lkml.kernel.org/r/20240326171045.410737-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kent Overstreet
|
0069455bcb |
fix missing vmalloc.h includes
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev [surenb@google.com: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com [kent.overstreet@linux.dev: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev [arnd@arndb.de: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org [surenb@google.com: fix arc build] Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
9636f055da |
mm/treewide: remove pXd_huge()
This API is not used anymore, drop it for the whole tree. Link: https://lkml.kernel.org/r/20240318200404.448346-13-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jakub Kicinski
|
2bd87951de |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_prueth.c net/mac80211/chan.c |
||
Jason Gunthorpe
|
e7bc47b166 |
s390: Stop using weak symbols for __iowrite64_copy()
Complete switching the __iowriteXX_copy() routines over to use #define and arch provided inline/macro functions instead of weak symbols. S390 has an implementation that simply calls another memcpy function. Inline this so the callers don't have to do two jumps. Link: https://lore.kernel.org/r/3-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com Acked-by: Niklas Schnelle <schnelle@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> |
||
Jason Gunthorpe
|
6ae798cbef |
s390: Implement __iowrite32_copy()
It is trivial to implement an inline to do this, so provide it in the s390 headers. Like the 64 bit version it should just invoke zpci_memcpy_toio() with the correct size. Link: https://lore.kernel.org/r/2-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com Acked-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> |
||
Claudio Imbrenda
|
412050af2e |
s390/mm: Fix clearing storage keys for huge pages
The function __storage_key_init_range() expects the end address to be
the first byte outside the range to be initialized. I.e. end - start
should be the size of the area to be initialized.
The current code works because __storage_key_init_range() will still loop
over every page in the range, but it is slower than using sske_frame().
Fixes:
|
||
Claudio Imbrenda
|
843c328068 |
s390/mm: Fix storage key clearing for guest huge pages
The function __storage_key_init_range() expects the end address to be
the first byte outside the range to be initialized. I.e. end - start
should be the size of the area to be initialized.
The current code works because __storage_key_init_range() will still loop
over every page in the range, but it is slower than using sske_frame().
Fixes:
|
||
David Hildenbrand
|
06201e00ee |
s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
commit |
||
Sven Schnelle
|
d111855ab7 |
s390/mm: Fix NULL pointer dereference
The recently added check to figure out if a fault happened on gmap ASCE
dereferences the gmap pointer in lowcore without checking that it is not
NULL. For all non-KVM processes the pointer is NULL, so that some value
from lowcore will be read. With the current layouts of struct gmap and
struct lowcore the read value (aka ASCE) is zero, so that this doesn't lead
to any observable bug; at least currently.
Fix this by adding the missing NULL pointer check.
Fixes:
|
||
Vasily Gorbik
|
ba05b39d54 |
s390/expoline: Make modules use kernel expolines
Currently, kernel modules contain their own set of expoline thunks. In the case of EXPOLINE_EXTERN, this involves postlinking of precompiled expoline.o. expoline.o is also necessary for out-of-source tree module builds. Now that the kernel modules area is less than 4 GB away from kernel expoline thunks, make modules use kernel expolines. Also make EXPOLINE_EXTERN the default if the compiler supports it. This simplifies build and aligns with the approach adopted by other architectures. Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Vasily Gorbik
|
ea84f14d2a |
s390/nospec: Correct modules thunk offset calculation
Fix offset calculation when branch target is more then 2Gb away. Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
236d70f82b |
s390/boot: Do not rescue .vmlinux.relocs section
The .vmlinux.relocs section is moved in front of the compressed kernel. The interim section rescue step is avoided as result. Suggested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
56b1069c40 |
s390/boot: Rework deployment of the kernel image
Rework deployment of kernel image for both compressed and uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel configuration variable. In case CONFIG_KERNEL_UNCOMPRESSED is disabled avoid uncompressing the kernel to a temporary buffer and copying it to the target address. Instead, uncompress it directly to the target destination. In case CONFIG_KERNEL_UNCOMPRESSED is enabled avoid moving the kernel to default 0x100000 location when KASLR is disabled or failed. Instead, use the uncompressed kernel image directly. In case KASLR is disabled or failed .amode31 section location in memory is not randomized and precedes the kernel image. In case CONFIG_KERNEL_UNCOMPRESSED is disabled that location overlaps the area used by the decompression algorithm. That is fine, since that area is not used after the decompression finished and the size of .amode31 section is not expected to exceed BOOT_HEAP_SIZE ever. There is no decompression in case CONFIG_KERNEL_UNCOMPRESSED is enabled. Therefore, rename decompress_kernel() to deploy_kernel(), which better describes both uncompressed and compressed cases. Introduce AMODE31_SIZE macro to avoid immediate value of 0x3000 (the size of .amode31 section) in the decompressor linker script. Modify the vmlinux linker script to force the size of .amode31 section to AMODE31_SIZE (the value of (_eamode31 - _samode31) could otherwise differ as result of compiler options used). Introduce __START_KERNEL macro that defines the kernel ELF image entry point and set it to the currrent value of 0x100000. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
54f2ecc318 |
s390: Map kernel at fixed location when KASLR is disabled
Since kernel virtual and physical address spaces are uncoupled the kernel is mapped at the top of the virtual address space in case KASLR is disabled. That does not pose any issue with regard to the kernel booting and operation, but makes it difficult to use a generated vmlinux with some debugging tools (e.g. gdb), because the exact location of the kernel image in virtual memory is unknown. Make that location known and introduce CONFIG_KERNEL_IMAGE_BASE configuration option. A custom CONFIG_KERNEL_IMAGE_BASE value that would break the virtual memory layout leads to a build error. The kernel image size is defined by KERNEL_IMAGE_SIZE macro and set to 512 MB, by analogy with x86. Suggested-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
c98d2ecae0 |
s390/mm: Uncouple physical vs virtual address spaces
The uncoupling physical vs virtual address spaces brings the following benefits to s390: - virtual memory layout flexibility; - closes the address gap between kernel and modules, it caused s390-only problems in the past (e.g. 'perf' bugs); - allows getting rid of trampolines used for module calls into kernel; - allows simplifying BPF trampoline; - minor performance improvement in branch prediction; - kernel randomization entropy is magnitude bigger, as it is derived from the amount of available virtual, not physical memory; The whole change could be described in two pictures below: before and after the change. Some aspects of the virtual memory layout setup are not clarified (number of page levels, alignment, DMA memory), since these are not a part of this change or secondary with regard to how the uncoupling itself is implemented. The focus of the pictures is to explain why __va() and __pa() macros are implemented the way they are. Memory layout in V==R mode: | Physical | Virtual | +- 0 --------------+- 0 --------------+ identity mapping start | | S390_lowcore | Low-address memory | +- 8 KB -----------+ | | | | | identity | phys == virt | | mapping | virt == phys | | | +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start |.amode31 text/data|.amode31 text/data| +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt start | | | | | | +- __kaslr_offset, __kaslr_offset_phys| kernel rand. phys/virt start | | | | kernel text/data | kernel text/data | phys == kvirt | | | +------------------+------------------+ kernel phys/virt end | | | | | | | | | | | | +- ident_map_size -+- ident_map_size -+ identity mapping end | | | ... unused gap | | | +---- vmemmap -----+ 'struct page' array start | | | virtually mapped | | memory map | | | +- __abs_lowcore --+ | | | Absolute Lowcore | | | +- __memcpy_real_area | | | Real Memory Copy| | | +- VMALLOC_START --+ vmalloc area start | | | vmalloc area | | | +- MODULES_VADDR --+ modules area start | | | modules area | | | +------------------+ UltraVisor Secure Storage limit | | | ... unused gap | | | +KASAN_SHADOW_START+ KASAN shadow memory start | | | KASAN shadow | | | +------------------+ ASCE limit Memory layout in V!=R mode: | Physical | Virtual | +- 0 --------------+- 0 --------------+ | | S390_lowcore | Low-address memory | +- 8 KB -----------+ | | | | | | | | ... unused gap | | | | +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start |.amode31 text/data|.amode31 text/data| +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB) | | | | | | +- __kaslr_offset_phys | kernel rand. phys start | | | | kernel text/data | | | | | +------------------+ | kernel phys end | | | | | | | | | | | | +- ident_map_size -+ | | | | ... unused gap | | | +- __identity_base + identity mapping start (>= 2GB) | | | identity | phys == virt - __identity_base | mapping | virt == phys + __identity_base | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +---- vmemmap -----+ 'struct page' array start | | | virtually mapped | | memory map | | | +- __abs_lowcore --+ | | | Absolute Lowcore | | | +- __memcpy_real_area | | | Real Memory Copy| | | +- VMALLOC_START --+ vmalloc area start | | | vmalloc area | | | +- MODULES_VADDR --+ modules area start | | | modules area | | | +- __kaslr_offset -+ kernel rand. virt start | | | kernel text/data | phys == (kvirt - __kaslr_offset) + | | __kaslr_offset_phys +- kernel .bss end + kernel rand. virt end | | | ... unused gap | | | +------------------+ UltraVisor Secure Storage limit | | | ... unused gap | | | +KASAN_SHADOW_START+ KASAN shadow memory start | | | KASAN shadow | | | +------------------+ ASCE limit Unused gaps in the virtual memory layout could be present or not - depending on how partucular system is configured. No page tables are created for the unused gaps. The relative order of vmalloc, modules and kernel image in virtual memory is defined by following considerations: - start of the modules area and end of the kernel should reside within 4GB to accommodate relative 32-bit jumps. The best way to achieve that is to place kernel next to modules; - vmalloc and module areas should locate next to each other to prevent failures and extra reworks in user level tools (makedumpfile, crash, etc.) which treat vmalloc and module addresses similarily; - kernel needs to be the last area in the virtual memory layout to easily distinguish between kernel and non-kernel virtual addresses. That is needed to (again) simplify handling of addresses in user level tools and make __pa() macro faster (see below); Concluding the above, the relative order of the considered virtual areas in memory is: vmalloc - modules - kernel. Therefore, the only change to the current memory layout is moving kernel to the end of virtual address space. With that approach the implementation of __pa() macro is straightforward - all linear virtual addresses less than kernel base are considered identity mapping: phys == virt - __identity_base All addresses greater than kernel base are kernel ones: phys == (kvirt - __kaslr_offset) + __kaslr_offset_phys By contrast, __va() macro deals only with identity mapping addresses: virt == phys + __identity_base .amode31 section is mapped separately and is not covered by __pa() macro. In fact, it could have been handled easily by checking whether a virtual address is within the section or not, but there is no need for that. Thus, let __pa() code do as little machine cycles as possible. The KASAN shadow memory is located at the very end of the virtual memory layout, at addresses higher than the kernel. However, that is not a linear mapping and no code other than KASAN instrumentation or API is expected to access it. When KASLR mode is enabled the kernel base address randomized within a memory window that spans whole unused virtual address space. The size of that window depends from the amount of physical memory available to the system, the limit imposed by UltraVisor (if present) and the vmalloc area size as provided by vmalloc= kernel command line parameter. In case the virtual memory is exhausted the minimum size of the randomization window is forcefully set to 2GB, which amounts to in 15 bits of entropy if KASAN is enabled or 17 bits of entropy in default configuration. The default kernel offset 0x100000 is used as a magic value both in the decompressor code and vmlinux linker script, but it will be removed with a follow-up change. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
f4cac27dc0 |
s390/crash: Use old os_info to create PT_LOAD headers
This is a preparatory rework to allow uncoupling virtual and physical addresses spaces. The vmcore ELF program headers describe virtual memory regions of a crashed kernel. User level tools use that information for the kernel text and data analysis (e.g vmcore-dmesg extracts the kernel log). Currently the kernel image is covered by program headers describing the identity mapping regions. But in the future the kernel image will be mapped into separate region outside of the identity mapping. Create the additional ELF program header that covers kernel image only, so that vmcore tools could locate kernel text and data. Further, the identity mapping in crashed and capture kernels will have different base address. Due to that __va() macro can not be used in the capture kernel. Instead, read crashed kernel identity mapping base address from os_info and use it for PT_LOAD type program headers creation. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
378e32aa81 |
s390/vmcoreinfo: Store virtual memory layout
This is a preparatory rework to allow uncoupling virtual and physical addresses spaces. The virtual memory layout is needed for address translation by crash tool when /proc/kcore device is used as the memory image. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
8572f52518 |
s390/os_info: Store virtual memory layout
This is a preparatory rework to allow uncoupling virtual and physical addresses spaces. The virtual memory layout will be read out by makedumpfile, crash and other user tools for virtual address translation. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
88702793c5 |
s390/os_info: Introduce value entries
Introduce entries that do not reference any data in memory, but rather provide values. Set the size of such entries to zero and do not compute checksum for them, since there is no data which integrity needs to be checked. The integrity of the value entries itself is still covered by the os_info checksum. Reserve the lowest unused entry index OS_INFO_RESERVED for future use - presumably for the number of entries present. That could later be used by user level tools. The existing tools would not notice any difference. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
5fb50fa66a |
s390/boot: Make .amode31 section address range explicit
This is a preparatory rework to allow uncoupling virtual and physical addresses spaces. Introduce .amode31 section address range AMODE31_START and AMODE31_END macros for later use. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
7de0446f0b |
s390/boot: Make identity mapping base address explicit
This is a preparatory rework to allow uncoupling virtual and physical addresses spaces. Currently the identity mapping base address is implicit and is always set to zero. Make it explicit by putting into __identity_base persistent boot variable and use it in proper context - which is the value of PAGE_OFFSET. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
3bb11234b1 |
s390/boot: Uncouple virtual and physical kernel offsets
This is a preparatory rework to allow uncoupling virtual and physical addresses spaces. Currently __kaslr_offset is the kernel offset in both physical memory on boot and in virtual memory after DAT mode is enabled. Uncouple these offsets and rename the physical address space variant to __kaslr_offset_phys while keep the name __kaslr_offset for the offset in virtual address space. Do not use __kaslr_offset_phys after DAT mode is enabled just yet, but still make it a persistent boot variable for later use. Use __kaslr_offset and __kaslr_offset_phys offsets in proper contexts and alter handle_relocs() function to distinguish between the two. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
236f324b74 |
s390/mm: Create virtual memory layout structure
This is a preparatory rework to allow uncoupling virtual and physical addresses spaces. Put virtual memory layout information into a structure to improve code generation when accessing the structure members, which are currently only ident_map_size and __kaslr_offset. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
bbe72f3902 |
s390/mm: Move KASLR related to <asm/page.h>
Move everyting KASLR related to <asm/page.h>, similarly to many other architectures. Acked-by: Heiko Carstens <hca@linux.ibm.com> Suggested-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
c8aef260c8 |
s390/boot: Swap vmalloc and Lowcore/Real Memory Copy areas
This is a preparatory rework to allow uncoupling virtual and physical addresses spaces. Currently the order of virtual memory areas is (the lowcore and .amode31 section are skipped, as it is irrelevant): identity mapping (the kernel is contained within) vmemmap vmalloc modules Absolute Lowcore Real Memory Copy In the future the kernel will be mapped separately and placed to the end of the virtual address space, so the layout would turn like this: identity mapping vmemmap vmalloc modules Absolute Lowcore Real Memory Copy kernel However, the distance between kernel and modules needs to be as little as possible, ideally - none. Thus, the Absolute Lowcore and Real Memory Copy areas would stay in the way and therefore need to be moved as well: identity mapping vmemmap Absolute Lowcore Real Memory Copy vmalloc modules kernel To facilitate such layout swap the vmalloc and Absolute Lowcore together with Real Memory Copy areas. As result, the current layout turns into: identity mapping (the kernel is contained within) vmemmap Absolute Lowcore Real Memory Copy vmalloc modules This will allow to locate the kernel directly next to the modules once it gets mapped separately. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
ecf74da64d |
s390/boot: Reduce size of identity mapping on overlap
In case vmemmap array could overlap with vmalloc area on virtual memory layout setup, the size of vmalloc area is decreased. That could result in less memory than user requested with vmalloc= kernel command line parameter. Instead, reduce the size of identity mapping (and the size of vmemmap array as result) to avoid such overlap. Further, currently the virtual memmory allocation "rolls" from top to bottom and it is only VMALLOC_START that could get increased due to the overlap. Change that to decrease- only, which makes the whole allocation algorithm more easy to comprehend. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
b2b15f079c |
s390/boot: Consider DCSS segments on memory layout setup
The maximum mappable physical address (as returned by arch_get_mappable_range() callback) is limited by the value of (1UL << MAX_PHYSMEM_BITS). The maximum physical address available to a DCSS segment is 512GB. In case the available online or offline memory size is less than the DCSS limit arch_get_mappable_range() would include never used [512GB..(1UL << MAX_PHYSMEM_BITS)] range. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
47bf817672 |
s390/boot: Do not force vmemmap to start at MAX_PHYSMEM_BITS
vmemmap is forcefully set to start at MAX_PHYSMEM_BITS at most. That could be needed in the past to limit ident_map_size to MAX_PHYSMEM_BITS. However since commit 75eba6ec0de1 ("s390: unify identity mapping limits handling") ident_map_size is limited in setup_ident_map_size() function, which is called earlier. Another reason to limit vmemmap start to MAX_PHYSMEM_BITS is because it was returned by arch_get_mappable_range() as the maximum mappable physical address. Since commit f641679dfe55 ("s390/mm: rework arch_get_mappable_range() callback") that is not required anymore. As result, there is no neccessity to limit vmemmap starting address with MAX_PHYSMEM_BITS. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Nina Schoetterl-Glausch
|
22fdd8ba61 |
KVM: s390: vsie: Use virt_to_phys for facility control block
In order for SIE to interpretively execute STFLE, it requires the real or absolute address of a facility-list control block. Before writing the location into the shadow SIE control block, convert it from a virtual address. We currently do not run into this bug because the lower 31 bits are the same for virtual and physical addresses. Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Link: https://lore.kernel.org/r/20240319164420.4053380-3-nsg@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20240319164420.4053380-3-nsg@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Alexander Gordeev
|
036cbbafbd |
s390/irq,nmi: Include <asm/vtime.h> header directly
update_timer_sys() and update_timer_mcck() are inlines used for CPU time accounting from the interrupt and machine-check handlers. These routines are specific to s390 architecture, but included via <linux/vtime.h> header implicitly. Avoid the extra loop and include <asm/vtime.h> header directly. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Link: https://lore.kernel.org/r/3fb696637c0eb7e9d6ffd6cbf9e647d7c5986b3d.1712760275.git.agordeev@linux.ibm.com |
||
Alexander Gordeev
|
60b8edba14 |
s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
__ARCH_HAS_VTIME_TASK_SWITCH macro is not used anymore. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Nicholas Piggin <npiggin@gmail.com> Link: https://lore.kernel.org/r/b1055852eab0ffea33ad16c92d6a825c83037c3e.1712760275.git.agordeev@linux.ibm.com |
||
Jakub Kicinski
|
94426ed213 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/unix/garbage.c |
||
Matthew Wilcox (Oracle)
|
d35c34bb32 |
s390/mm: Convert gmap_make_secure to use a folio
Remove uses of deprecated page APIs, and move the check for large folios to here to avoid taking the folio lock if the folio is too large. We could do better here by attempting to split the large folio, but I'll leave that improvement for someone who can test it. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240322161149.2327518-3-willy@infradead.org Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Matthew Wilcox (Oracle)
|
259e660d91 |
s390/mm: Convert make_page_secure to use a folio
These page APIs are deprecated, so convert the incoming page to a folio and use the folio APIs instead. The ultravisor API cannot handle large folios, so return -EINVAL if one has slipped through. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240322161149.2327518-2-willy@infradead.org Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Thomas Richter
|
f10933cbd2 |
s390/cpum_cf: make crypto counters upward compatible across machine types
The CPU Measurement facility crypto counter set functionality is defined by the Second Counter Version Number. This number varies between machine types, but is upward compatible. Lessen the checks to reflect this behavior. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Heiko Carstens
|
4f00d4ef66 |
s390: adjust indentation of RELOCS command build step out
Common pattern in non-verbose build output for quiet commands is that the shorthand of a command including whitespace contains at least eight characters. Adjust this for the RELOCS command, which comes only with seven characters. Before: SORTTAB vmlinux CC arch/s390/boot/version.o RELOCS arch/s390/boot/relocs.S OBJCOPY arch/s390/boot/info.bin After: SORTTAB vmlinux CC arch/s390/boot/version.o RELOCS arch/s390/boot/relocs.S OBJCOPY arch/s390/boot/info.bin Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Holger Dengler
|
b3840c8bfc |
s390/ap: rename ap debug configuration option
The configuration option ZCRYPT_DEBUG is used only in ap queue code, so rename it to AP_DEBUG. It also no longer depends on ZCRYPT but on AP. While at it, also update the help text. Signed-off-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Holger Dengler
|
123760841a |
s390/ap: modularize ap bus
There is no hard requirement to have the ap bus statically in the kernel, so add an option to compile it as module. Cc: Tony Krowiak <akrowiak@linux.ibm.com> Cc: Halil Pasic <pasic@linux.ibm.com> Signed-off-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Anthony Krowiak <akrowiak@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Holger Dengler
|
2a483d333f |
s390/chsc: use notifier for AP configuration changes
The direct dependency of chsc and the AP bus prevents the modularization of ap bus. Introduce a notifier interface for AP changes, which decouples the producer of the change events (chsc) from the consumer (ap_bus). Remove the ap_cfg_chg() interface and replace it with the notifier invocation. The ap bus module registers a notification handler, which triggers the AP bus scan. Cc: Vineeth Vijayan <vneethv@linux.ibm.com> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Acked-by: Vineeth Vijayan <vneethv@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Holger Dengler
|
05272aa499 |
s390/uv: export prot_virt_guest symbol in uv
The inline function is_prot_virt_guest() in asm/uv.h makes use of the prot_virt_guest symbol. As this inline function can be called by other parts of the kernel (modules and built-in), the symbol should be exported, similar to the prot_virt_host symbol. One consumer of is_prot_virt_guest() will be the ap bus code. Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Holger Dengler
|
8dec9cb9f5 |
s390/ap: use static qci information
Since qci is available on most of the current machines, move away from the dynamic buffers for qci information and store it instead in a statically defined buffer. The new flags member in struct ap_config_info is now used as an indicator, if qci is available in the system (at least one of these bits is set). Suggested-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
Adrian Hunter
|
c8e3a8b6f2 |
vdso: Consolidate vdso_calc_delta()
Consolidate vdso_calc_delta(), in preparation for further simplification. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-2-adrian.hunter@intel.com |
||
Sumanth Korikkar
|
378ca2d2ad |
s390/entry: align system call table on 8 bytes
Align system call table on 8 bytes. With sys_call_table entry size of 8 bytes that eliminates the possibility of a system call pointer crossing cache line boundary. Cc: stable@kernel.org Suggested-by: Ulrich Weigand <ulrich.weigand@de.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Thomas Richter
|
e9f3af02f6 |
s390/pai: fix sampling event removal for PMU device driver
In case of a sampling event, the PAI PMU device drivers need a
reference to this event. Currently to PMU device driver reference
is removed when a sampling event is destroyed. This may lead to
situations where the reference of the PMU device driver is removed
while being used by a different sampling event.
Reset the event reference pointer of the PMU device driver when
a sampling event is deleted and before the next one might be added.
Fixes:
|
||
Ilya Leoshkevich
|
c9c260681f |
s390/preempt: mark all functions __always_inline
preempt_count-related functions are quite ubiquitous and may be called by noinstr ones, introducing unwanted instrumentation. Here is one example call chain: irqentry_nmi_enter() # noinstr lockdep_hardirqs_enabled() this_cpu_read() __pcpu_size_call_return() this_cpu_read_*() this_cpu_generic_read() __this_cpu_generic_read_nopreempt() preempt_disable_notrace() __preempt_count_inc() __preempt_count_add() They are very small, so there are no significant downsides to force-inlining them. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20240320230007.4782-3-iii@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Ilya Leoshkevich
|
01cac82ae0 |
s390/atomic: mark all functions __always_inline
Atomic functions are quite ubiquitous and may be called by noinstr ones, introducing unwanted instrumentation. They are very small, so there are no significant downsides to force-inlining them. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20240320230007.4782-2-iii@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> |
||
Heiko Carstens
|
e6ec07dc6d |
s390/mm: fix NULL pointer dereference
The recently added check to figure out if a fault happened on gmap ASCE
dereferences the gmap pointer in lowcore without checking that it is not
NULL. For all non-KVM processes the pointer is NULL, so that some value
from lowcore will be read. With the current layouts of struct gmap and
struct lowcore the read value (aka ASCE) is zero, so that this doesn't lead
to any observable bug; at least currently.
Fix this by adding the missing NULL pointer check.
Fixes:
|
||
Eric Biggers
|
29ce50e078 |
crypto: remove CONFIG_CRYPTO_STATS
Remove support for the "Crypto usage statistics" feature (CONFIG_CRYPTO_STATS). This feature does not appear to have ever been used, and it is harmful because it significantly reduces performance and is a large maintenance burden. Covering each of these points in detail: 1. Feature is not being used Since these generic crypto statistics are only readable using netlink, it's fairly straightforward to look for programs that use them. I'm unable to find any evidence that any such programs exist. For example, Debian Code Search returns no hits except the kernel header and kernel code itself and translations of the kernel header: https://codesearch.debian.net/search?q=CRYPTOCFGA_STAT&literal=1&perpkg=1 The patch series that added this feature in 2018 (https://lore.kernel.org/linux-crypto/1537351855-16618-1-git-send-email-clabbe@baylibre.com/) said "The goal is to have an ifconfig for crypto device." This doesn't appear to have happened. It's not clear that there is real demand for crypto statistics. Just because the kernel provides other types of statistics such as I/O and networking statistics and some people find those useful does not mean that crypto statistics are useful too. Further evidence that programs are not using CONFIG_CRYPTO_STATS is that it was able to be disabled in RHEL and Fedora as a bug fix (https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2947). Even further evidence comes from the fact that there are and have been bugs in how the stats work, but they were never reported. For example, before Linux v6.7 hash stats were double-counted in most cases. There has also never been any documentation for this feature, so it might be hard to use even if someone wanted to. 2. CONFIG_CRYPTO_STATS significantly reduces performance Enabling CONFIG_CRYPTO_STATS significantly reduces the performance of the crypto API, even if no program ever retrieves the statistics. This primarily affects systems with a large number of CPUs. For example, https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2039576 reported that Lustre client encryption performance improved from 21.7GB/s to 48.2GB/s by disabling CONFIG_CRYPTO_STATS. It can be argued that this means that CONFIG_CRYPTO_STATS should be optimized with per-cpu counters similar to many of the networking counters. But no one has done this in 5+ years. This is consistent with the fact that the feature appears to be unused, so there seems to be little interest in improving it as opposed to just disabling it. It can be argued that because CONFIG_CRYPTO_STATS is off by default, performance doesn't matter. But Linux distros tend to error on the side of enabling options. The option is enabled in Ubuntu and Arch Linux, and until recently was enabled in RHEL and Fedora (see above). So, even just having the option available is harmful to users. 3. CONFIG_CRYPTO_STATS is a large maintenance burden There are over 1000 lines of code associated with CONFIG_CRYPTO_STATS, spread among 32 files. It significantly complicates much of the implementation of the crypto API. After the initial submission, many fixes and refactorings have consumed effort of multiple people to keep this feature "working". We should be spending this effort elsewhere. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Corentin Labbe <clabbe@baylibre.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> |
||
Jakub Kicinski
|
5e47fbe5ce |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Jakub Kicinski
|
2a702c2e57 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZgHylwAKCRDbK58LschI gzmaAPwKhDFFSU/DU08k22muJxLIXVR7Xx04baJ9mPiFrqZyyAEA8RFNamC7wZIB AnfwwoDjfDTP60rlXFaEf8UT5PpA7Ao= =/KF6 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-03-25 We've added 38 non-merge commits during the last 13 day(s) which contain a total of 50 files changed, 867 insertions(+), 274 deletions(-). The main changes are: 1) Add the ability to specify and retrieve BPF cookie also for raw tracepoint programs in order to ease migration from classic to raw tracepoints, from Andrii Nakryiko. 2) Allow the use of bpf_get_{ns_,}current_pid_tgid() helper for all program types and add additional BPF selftests, from Yonghong Song. 3) Several improvements to bpftool and its build, for example, enabling libbpf logs when loading pid_iter in debug mode, from Quentin Monnet. 4) Check the return code of all BPF-related set_memory_*() functions during load and bail out in case they fail, from Christophe Leroy. 5) Avoid a goto in regs_refine_cond_op() such that the verifier can be better integrated into Agni tool which doesn't support backedges yet, from Harishankar Vishwanathan. 6) Add a small BPF trie perf improvement by always inlining longest_prefix_match, from Jesper Dangaard Brouer. 7) Small BPF selftest refactor in bpf_tcp_ca.c to utilize start_server() helper instead of open-coding it, from Geliang Tang. 8) Improve test_tc_tunnel.sh BPF selftest to prevent client connect before the server bind, from Alessandro Carminati. 9) Fix BPF selftest benchmark for older glibc and use syscall(SYS_gettid) instead of gettid(), from Alan Maguire. 10) Implement a backward-compatible method for struct_ops types with additional fields which are not present in older kernels, from Kui-Feng Lee. 11) Add a small helper to check if an instruction is addr_space_cast from as(0) to as(1) and utilize it in x86-64 JIT, from Puranjay Mohan. 12) Small cleanup to remove unnecessary error check in bpf_struct_ops_map_update_elem, from Martin KaFai Lau. 13) Improvements to libbpf fd validity checks for BPF map/programs, from Mykyta Yatsenko. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (38 commits) selftests/bpf: Fix flaky test btf_map_in_map/lookup_update bpf: implement insn_is_cast_user() helper for JITs bpf: Avoid get_kernel_nofault() to fetch kprobe entry IP selftests/bpf: Use start_server in bpf_tcp_ca bpf: Sync uapi bpf.h to tools directory libbpf: Add new sec_def "sk_skb/verdict" selftests/bpf: Mark uprobe trigger functions with nocf_check attribute selftests/bpf: Use syscall(SYS_gettid) instead of gettid() wrapper in bench bpf-next: Avoid goto in regs_refine_cond_op() bpftool: Clean up HOST_CFLAGS, HOST_LDFLAGS for bootstrap bpftool selftests/bpf: scale benchmark counting by using per-CPU counters bpftool: Remove unnecessary source files from bootstrap version bpftool: Enable libbpf logs when loading pid_iter in debug mode selftests/bpf: add raw_tp/tp_btf BPF cookie subtests libbpf: add support for BPF cookie for raw_tp/tp_btf programs bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs bpf: pass whole link instead of prog when triggering raw tracepoint bpf: flatten bpf_probe_register call chain selftests/bpf: Prevent client connect before server bind in test_tc_tunnel.sh selftests/bpf: Add a sk_msg prog bpf_get_ns_current_pid_tgid() test ... ==================== Link: https://lore.kernel.org/r/20240325233940.7154-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Paolo Abeni
|
37ccdf7f11 |
bpf-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZgHmTAAKCRDbK58LschI g1gWAP9HjAWE/Sy0B2t9opIiTqRzdMJLYs2B4OFeHRI6+qQg0gD6A4jsKEh/xmtG Hhjw+AElJRFZ3SUIT4mZlljzUHIYYAA= =T0lM -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-03-25 The following pull-request contains BPF updates for your *net* tree. We've added 17 non-merge commits during the last 12 day(s) which contain a total of 19 files changed, 184 insertions(+), 61 deletions(-). The main changes are: 1) Fix an arm64 BPF JIT bug in BPF_LDX_MEMSX implementation's offset handling found via test_bpf module, from Puranjay Mohan. 2) Various fixups to the BPF arena code in particular in the BPF verifier and around BPF selftests to match latest corresponding LLVM implementation, from Puranjay Mohan and Alexei Starovoitov. 3) Fix xsk to not assume that metadata is always requested in TX completion, from Stanislav Fomichev. 4) Fix riscv BPF JIT's kfunc parameter incompatibility between BPF and the riscv ABI which requires sign-extension on int/uint, from Pu Lehui. 5) Fix s390x BPF JIT's bpf_plt pointer arithmetic which triggered a crash when testing struct_ops, from Ilya Leoshkevich. 6) Fix libbpf's arena mmap handling which had incorrect u64-to-pointer cast on 32-bit architectures, from Andrii Nakryiko. 7) Fix libbpf to define MFD_CLOEXEC when not available, from Arnaldo Carvalho de Melo. 8) Fix arm64 BPF JIT implementation for 32bit unconditional bswap which resulted in an incorrect swap as indicated by test_bpf, from Artem Savkov. 9) Fix BPF man page build script to use silent mode, from Hangbin Liu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: riscv, bpf: Fix kfunc parameters incompatibility between bpf and riscv abi bpf: verifier: reject addr_space_cast insn without arena selftests/bpf: verifier_arena: fix mmap address for arm64 bpf: verifier: fix addr_space_cast from as(1) to as(0) libbpf: Define MFD_CLOEXEC if not available arm64: bpf: fix 32bit unconditional bswap bpf, arm64: fix bug in BPF_LDX_MEMSX libbpf: fix u64-to-pointer cast on 32-bit arches s390/bpf: Fix bpf_plt pointer arithmetic xsk: Don't assume metadata is always requested in TX completion selftests/bpf: Add arena test case for 4Gbyte corner case selftests/bpf: Remove hard coded PAGE_SIZE macro. libbpf, selftests/bpf: Adjust libbpf, bpftool, selftests to match LLVM bpf: Clarify bpf_arena comments. MAINTAINERS: Update email address for Quentin Monnet scripts/bpf_doc: Use silent mode when exec make cmd bpf: Temporarily disable atomic operations in BPF arena ==================== Link: https://lore.kernel.org/r/20240325213520.26688-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
Ilya Leoshkevich
|
7ded842b35 |
s390/bpf: Fix bpf_plt pointer arithmetic
Kui-Feng Lee reported a crash on s390x triggered by the
dummy_st_ops/dummy_init_ptr_arg test [1]:
[<0000000000000002>] 0x2
[<00000000009d5cde>] bpf_struct_ops_test_run+0x156/0x250
[<000000000033145a>] __sys_bpf+0xa1a/0xd00
[<00000000003319dc>] __s390x_sys_bpf+0x44/0x50
[<0000000000c4382c>] __do_syscall+0x244/0x300
[<0000000000c59a40>] system_call+0x70/0x98
This is caused by GCC moving memcpy() after assignments in
bpf_jit_plt(), resulting in NULL pointers being written instead of
the return and the target addresses.
Looking at the GCC internals, the reordering is allowed because the
alias analysis thinks that the memcpy() destination and the assignments'
left-hand-sides are based on different objects: new_plt and
bpf_plt_ret/bpf_plt_target respectively, and therefore they cannot
alias.
This is in turn due to a violation of the C standard:
When two pointers are subtracted, both shall point to elements of the
same array object, or one past the last element of the array object
...
From the C's perspective, bpf_plt_ret and bpf_plt are distinct objects
and cannot be subtracted. In the practical terms, doing so confuses the
GCC's alias analysis.
The code was written this way in order to let the C side know a few
offsets defined in the assembly. While nice, this is by no means
necessary. Fix the noncompliance by hardcoding these offsets.
[1] https://lore.kernel.org/bpf/c9923c1d-971d-4022-8dc8-1364e929d34c@gmail.com/
Fixes:
|
||
Linus Torvalds
|
f9c035492f |
more s390 updates for 6.9 merge window
- Various virtual vs physical address usage fixes - Add new bitwise types and helper functions and use them in s390 specific drivers and code to make it easier to find virtual vs physical address usage bugs. Right now virtual and physical addresses are identical for s390, except for module, vmalloc, and similar areas. This will be changed, hopefully with the next merge window, so that e.g. the kernel image and modules will be located close to each other, allowing for direct branches and also for some other simplifications. As a prerequisite this requires to fix all misuses of virtual and physical addresses. As it turned out people are so used to the concept that virtual and physical addresses are the same, that new bugs got added to code which was already fixed. In order to avoid that even more code gets merged which adds such bugs add and use new bitwise types, so that sparse can be used to find such usage bugs. Most likely the new types can go away again after some time - Provide a simple ARCH_HAS_DEBUG_VIRTUAL implementation - Fix kprobe branch handling: if an out-of-line single stepped relative branch instruction has a target address within a certain address area in the entry code, the program check handler may incorrectly execute cleanup code as if KVM code was executed, leading to crashes - Fix reference counting of zcrypt card objects - Various other small fixes and cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmX5lIYACgkQIg7DeRsp bsJYxA//cYGSaopFLuHxQndQi5UcMx0NMcaD9auMyDraPWJH0/F+vIWnLPxgJ1wB zfav3Ssvdd/41rPIXHSGxUzXzv+EHCY5+91t4plfCxWtd1SbpJ8gG2vrXTH512ql RhJ7crRFQqoigIxlsdwUkwGSqd+L6H73ikKzQyAaJFKC/e9BEpCH4zb8NuTqCeJE a81/A8BGSIo4Fk4qj1T5bHZzkznmxtisMPGXzoKa28LKhzwqqbGZYeHohnkaT6co UpY/HMdhGH74kkKqpF0cLDI0Bd8ptfjbcVcKyJ8OD7vCfinIM3bZdYg0KIzyRMhu d49JDXctPiePXTCi8X+AJnhNYFGlHuDciEpTMzkw97kwhmCfAOW5UaAlBo8dJqat zt76Cxbl3D+hYI7Wbs9lt2QsXso4/1fMMNQbu9pwyMKxHlCBVe+ZqNIl0gP8OAXZ 51pghcLlcwYNeYRkSzvfEhO9aeUsRg40O5UBCklVMRScrx7qie2wsllEFavb7vZd Ejv89dvn0KtzYIHG+U5SQ9d0x1JL9qApVM06sCGZGsBM9r4hHFGa8x57Pr+51ZPs j0qbr7iAWgCGXN8c3p/Nrev6eqBio81CpD9PWik7QpJS38EKkussqfPdQitgQpsi 7r0Sx51oBzyAVadmVAf8/flUrbJTvD3BdbkzN99W4BXyARJk7CI= =Vh2x -----END PGP SIGNATURE----- Merge tag 's390-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Heiko Carstens: - Various virtual vs physical address usage fixes - Add new bitwise types and helper functions and use them in s390 specific drivers and code to make it easier to find virtual vs physical address usage bugs. Right now virtual and physical addresses are identical for s390, except for module, vmalloc, and similar areas. This will be changed, hopefully with the next merge window, so that e.g. the kernel image and modules will be located close to each other, allowing for direct branches and also for some other simplifications. As a prerequisite this requires to fix all misuses of virtual and physical addresses. As it turned out people are so used to the concept that virtual and physical addresses are the same, that new bugs got added to code which was already fixed. In order to avoid that even more code gets merged which adds such bugs add and use new bitwise types, so that sparse can be used to find such usage bugs. Most likely the new types can go away again after some time - Provide a simple ARCH_HAS_DEBUG_VIRTUAL implementation - Fix kprobe branch handling: if an out-of-line single stepped relative branch instruction has a target address within a certain address area in the entry code, the program check handler may incorrectly execute cleanup code as if KVM code was executed, leading to crashes - Fix reference counting of zcrypt card objects - Various other small fixes and cleanups * tag 's390-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits) s390/entry: compare gmap asce to determine guest/host fault s390/entry: remove OUTSIDE macro s390/entry: add CIF_SIE flag and remove sie64a() address check s390/cio: use while (i--) pattern to clean up s390/raw3270: make class3270 constant s390/raw3270: improve raw3270_init() readability s390/tape: make tape_class constant s390/vmlogrdr: make vmlogrdr_class constant s390/vmur: make vmur_class constant s390/zcrypt: make zcrypt_class constant s390/mm: provide simple ARCH_HAS_DEBUG_VIRTUAL support s390/vfio_ccw_cp: use new address translation helpers s390/iucv: use new address translation helpers s390/ctcm: use new address translation helpers s390/lcs: use new address translation helpers s390/qeth: use new address translation helpers s390/zfcp: use new address translation helpers s390/tape: fix virtual vs physical address confusion s390/3270: use new address translation helpers s390/3215: use new address translation helpers ... |
||
Sven Schnelle
|
64c3431808 |
s390/entry: compare gmap asce to determine guest/host fault
With the current implementation, there are some cornercases where a host fault would be treated as a guest fault, for example when the sie instruction causes a program check. Therefore store the gmap asce in ptregs, and use that to compare the primary asce from the fault instead of matching instruction addresses. Suggested-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Sven Schnelle
|
29e5bc0f02 |
s390/entry: remove OUTSIDE macro
With only one OUTSIDE user left, remove the macro and move the code directly to the machine check handler. This has the advantage that it is much easier to determine which registers are used. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Sven Schnelle
|
c239c83ed5 |
s390/entry: add CIF_SIE flag and remove sie64a() address check
When a program check, interrupt or machine check is triggered, the PSW address is compared to a certain range of the sie64a() function to figure out whether SIE was interrupted and a cleanup of SIE is needed. This doesn't work with kprobes: If kprobes probes an instruction, it copies the instruction to the kprobes instruction page and overwrites the original instruction with an undefind instruction (Opcode 00). When this instruction is hit later, kprobes single-steps the instruction on the kprobes_instruction page. However, if this instruction is a relative branch instruction it will now point to a different location in memory due to being moved to the kprobes instruction page. If the new branch target points into sie64a() the kernel assumes it interrupted SIE when processing the breakpoint and will crash trying to access the SIE control block. Instead of comparing the address, introduce a new CIF_SIE flag which indicates whether SIE was interrupted. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Suggested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Linus Torvalds
|
4f712ee0cb |
S390:
* Changes to FPU handling came in via the main s390 pull request * Only deliver to the guest the SCLP events that userspace has requested. * More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same). * Fix selftests undefined behavior. x86: * Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec. * Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests). * Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized. * Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest. * Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit. * Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code. * Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. * Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot. * Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels. * Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization. * Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives. * Fix the debugregs ABI for 32-bit KVM. * Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD. * Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work. * Cleanup the logic for checking if the currently loaded vCPU is in-kernel. * Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel. x86 Xen emulation: * Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same. * When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. * Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). * Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. RISC-V: * Support exception and interrupt handling in selftests * New self test for RISC-V architectural timer (Sstc extension) * New extension support (Ztso, Zacas) * Support userspace emulation of random number seed CSRs. ARM: * Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers * Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it * Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path * Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register * Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: * Set reserved bits as zero in CPUCFG. * Start SW timer only when vcpu is blocking. * Do not restart SW timer when it is expired. * Remove unnecessary CSR register saving during enter guest. * Misc cleanups and fixes as usual. Generic: * cleanup Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else. * Factor common "select" statements in common code instead of requiring each architecture to specify it * Remove thoroughly obsolete APIs from the uapi headers. * Move architecture-dependent stuff to uapi/asm/kvm.h * Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded. * Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker. Selftests: * Reduce boilerplate especially when utilize selftest TAP infrastructure. * Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory. * Fix benign bugs where tests neglect to close() guest_memfd files. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmX0iP8UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroND7wf+JZoNvwZ+bmwWe/4jn/YwNoYi/C5z eypn8M1gsWEccpCpqPBwznVm9T29rF4uOlcMvqLEkHfTpaL1EKUUjP1lXPz/ileP 6a2RdOGxAhyTiFC9fjy+wkkjtLbn1kZf6YsS0hjphP9+w0chNbdn0w81dFVnXryd j7XYI8R/bFAthNsJOuZXSEjCfIHxvTTG74OrTf1B1FEBB+arPmrgUeJftMVhffQK Sowgg8L/Ii/x6fgV5NZQVSIyVf1rp8z7c6UaHT4Fwb0+RAMW8p9pYv9Qp1YkKp8y 5j0V9UzOHP7FRaYimZ5BtwQoqiZXYylQ+VuU/Y2f4X85cvlLzSqxaEMAPA== =mqOV -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ... |
||
Christophe Leroy
|
e60adf5132 |
bpf: Take return from set_memory_rox() into account with bpf_jit_binary_lock_ro()
set_memory_rox() can fail, leaving memory unprotected. Check return and bail out when bpf_jit_binary_lock_ro() returns an error. Link: https://github.com/KSPP/linux/issues/7 Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: linux-hardening@vger.kernel.org <linux-hardening@vger.kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com> # s390x Acked-by: Tiezhu Yang <yangtiezhu@loongson.cn> # LoongArch Reviewed-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> # MIPS Part Message-ID: <036b6393f23a2032ce75a1c92220b2afcb798d5d.1709850515.git.christophe.leroy@csgroup.eu> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Linus Torvalds
|
e5eb28f6d1 |
- Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfMnvgAKCRDdBJ7gKXxA jjKMAP4/Upq07D4wjkMVPb+QrkipbbLpdcgJ++q3z6rba4zhPQD+M3SFriIJk/Xh tKVmvihFxfAhdDthseXcIf1nBjMALwY= =8rVc -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. * tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) nilfs2: prevent kernel bug at submit_bh_wbc() nilfs2: fix failure to detect DAT corruption in btree and direct mappings ocfs2: enable ocfs2_listxattr for special files ocfs2: remove SLAB_MEM_SPREAD flag usage assoc_array: fix the return value in assoc_array_insert_mid_shortcut() buildid: use kmap_local_page() watchdog/core: remove sysctl handlers from public header nilfs2: use div64_ul() instead of do_div() mul_u64_u64_div_u64: increase precision by conditionally swapping a and b kexec: copy only happens before uchunk goes to zero get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't abuse ksig->info.si_signo and ksig->sig const_structs.checkpatch: add device_type Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() list: leverage list_is_head() for list_entry_is_head() nilfs2: MAINTAINERS: drop unreachable project mirror site smp: make __smp_processor_id() 0-argument macro fat: fix uninitialized field in nostale filehandles ... |
||
Linus Torvalds
|
902861e34c |
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx TMNhHfyiHYDTx/GAV9NXW84tasJSDgA= =TG55 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ... |
||
Paolo Bonzini
|
17193ced2d |
- Memop selftest rotate fix
- SCLP event bits over indication fix - Missing virt_to_phys for the CRYCB fix -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmXcfYoACgkQ41TmuOI4 ufhPExAAxdcg3WjnTe/EYe+GnyjKo3nZs4y9dhZk9gf06qEYEawhg0ug5akzRZIH SDKeFqOXzl/ZRuL5hvfYBzxpy+IR3rWAYhBKUyxR6aJBl+RZKlf+Xn7l8iIKbNDq vAtLh9Hqza5IJiw/jtorw90TmiHDKvMlvft4UMG3t1IppyktUuuH0aujaVpeKtMR 8qVyGsaTmNHip6Pi7w3WUnvYPkMNLoM7UIPhBAvWrJyYrLxao8pKEGWHaKwbMNHL Om4bjykfFCZ1Cs9aLZDLEasuD61Fpp41DnvImYm77yuDOdI4WalIlV7F5NbjQhhd IrQdsmlZc+N+HKcYvia6MnzAChTpo25pynvW7xXFQIfl/9VxcMFAfSLqLZMGMKFC IwzwI+BA3+bgw6zbN2z2uBShIom7Zzr689U8mbt5q7JborOH38qd5+IX6QFwUtTv IPHrgcULdWHWT5TRaIp61cB9YzCx2YU1QrMWEUVehldQqGEt8ANdZU5Ov0KG1BVl L9ULBIEnJ2ib1pGA7Xlxl2U0Lr2w/dg/p7EAdnOGes50GfEwEjtBzb7VO9Xfrz/Z j927hQO354Y8OYRFjKDjTceENynCiYsbNEhTHE6qFRIwAmeSVk4PT+vIXO6wZlZi Ee3LxsvVUnhYuC7sZbBUNhyiEjNn6GG3LxAtPeDoD+HvhqXI2Qg= =bwck -----END PGP SIGNATURE----- Merge tag 'kvm-s390-next-6.9-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - Memop selftest rotate fix - SCLP event bits over indication fix - Missing virt_to_phys for the CRYCB fix |
||
Heiko Carstens
|
5f58bde726 |
s390/mm: provide simple ARCH_HAS_DEBUG_VIRTUAL support
Provide a very simple ARCH_HAS_DEBUG_VIRTUAL implementation. For now errors are only reported for the following cases: - Trying to translate a vmalloc or module address to a physical address - Translating a supposed to be ZONE_DMA virtual address into a physical address, and the resulting physical address is larger than two GiB Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Heiko Carstens
|
e1f51be68d |
s390/cio,idal: fix virtual vs physical address confusion
Fix virtual vs physical address confusion. This does not fix a bug since virtual and physical address spaces are currently the same. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Heiko Carstens
|
0c9580cebb |
s390/cio,idal: remove superfluous virt_to_phys() conversion
Only the last 12 bits of virtual / physical addresses are used when masking with IDA_BLOCK_SIZE - 1. Given that the bits are the same regardless of virtual or physical address, remove the virtual to physical address conversion. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Heiko Carstens
|
57bc3c2219 |
s390/cio,idal: code cleanup
Adjust coding style, partially refactor code, and use kcalloc() instead of kmalloc() to allocate an idaw array. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Halil Pasic
|
e3e9bda38e |
s390/virtio_ccw: use DMA handle from DMA API
Change and use ccw_device_dma_zalloc() so it returns a virtual address like before, which can be used to access data. However also pass a new dma32_t pointer type handle, which correlates to the returned virtual address. This pointer is used to directly pass/set the DMA handle as returned by the DMA API. Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Heiko Carstens
|
1bcf7f48b7 |
s390/cio: use bitwise types to allow for type checking
Change types of I/O structure members which contain physical addresses to dma32_t and dma64_t bitwise types. This allows to make use of sparse (aka "make C=1") to find incorrect usage of physical addresses. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Halil Pasic
|
8b19e145e8 |
s390/cio: introduce bitwise dma types and helper functions
Introduce dma32_t and dma64_t bitwise types, which are supposed to be used for 31 and 64 bit DMA capable addresses. This allows to use sparse (make C=1) for type checking, so that incorrect usages can be easily found. Also add a couple of helper functions which - convert virtual to DMA addresses and vice versa - allow for simple logical and arithmetic operations on DMA addresses - convert DMA addresses to plain u32 and u64 values All helper functions exist to avoid excessive casting in C code. Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Co-developed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Steffen Maier <maier@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
Mete Durlu
|
367c50f784 |
s390/vtime: fix average steal time calculation
Current average steal timer calculation produces volatile and inflated
values. The only user of this value is KVM so far and it uses that to
decide whether or not to yield the vCPU which is seeing steal time.
KVM compares average steal timer to a threshold and if the threshold
is past then it does not allow CPU polling and yields it to host, else
it keeps the CPU by polling.
Since KVM's steal time threshold is very low by default (%10) it most
likely is not effected much by the bloated average steal timer values
because the operating region is pretty small. However there might be
new users in the future who might rely on this number. Fix average
steal timer calculation by changing the formula from:
avg_steal_timer = avg_steal_timer / 2 + steal_timer;
to the following:
avg_steal_timer = (avg_steal_timer + steal_timer) / 2;
This ensures that avg_steal_timer is actually a naive average of steal
timer values. It now closely follows steal timer values but of course
in a smoother manner.
Fixes:
|
||
Alexander Gordeev
|
4dc890df63 |
s390/sysinfo: allow response buffer in normal memory
As provided with commit
|
||
Linus Torvalds
|
216532e147 |
hardening updates for v6.9-rc1
- string.h and related header cleanups (Tanzir Hasan, Andy Shevchenko) - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev, Harshit Mogalapalli) - selftests/powerpc: Fix load_unaligned_zeropad build failure (Michael Ellerman) - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn) - Handle tail call optimization better in LKDTM (Douglas Anderson) - Use long form types in overflow.h (Andy Shevchenko) - Add flags param to string_get_size() (Andy Shevchenko) - Add Coccinelle script for potential struct_size() use (Jacob Keller) - Fix objtool corner case under KCFI (Josh Poimboeuf) - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng) - Add str_plural() helper (Michal Wajdeczko, Kees Cook) - Ignore relocations in .notes section - Add comments to explain how __is_constexpr() works - Fix m68k stack alignment expectations in stackinit Kunit test - Convert string selftests to KUnit - Add KUnit tests for fortified string functions - Improve reporting during fortified string warnings - Allow non-type arg to type_max() and type_min() - Allow strscpy() to be called with only 2 arguments - Add binary mode to leaking_addresses scanner - Various small cleanups to leaking_addresses scanner - Adding wrapping_*() arithmetic helper - Annotate initial signed integer wrap-around in refcount_t - Add explicit UBSAN section to MAINTAINERS - Fix UBSAN self-test warnings - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL - Reintroduce UBSAN's signed overflow sanitizer -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmXvm5kWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJiQqD/4mM6SWZpYHKlR1nEiqIyz7Hqr9 g4oguuw6HIVNJXLyeBI5Hd43CTeHPA0e++EETqhUAt7HhErxfYJY+JB221nRYmu+ zhhQ7N/xbTMV/Je7AR03kQjhiMm8LyEcM2X4BNrsAcoCieQzmO3g0zSp8ISzLUE0 PEEmf1lOzMe3gK2KOFCPt5Hiz9sGWyN6at+BQubY18tQGtjEXYAQNXkpD5qhGn4a EF693r/17wmc8hvSsjf4AGaWy1k8crG0WfpMCZsaqftjj0BbvOC60IDyx4eFjpcy tGyAJKETq161AkCdNweIh2Q107fG3tm0fcvw2dv8Wt1eQCko6M8dUGCBinQs/thh TexjJFS/XbSz+IvxLqgU+C5qkOP23E0M9m1dbIbOFxJAya/5n16WOBlGr3ae2Wdq /+t8wVSJw3vZiku5emWdFYP1VsdIHUjVa5QizFaaRhzLGRwhxVV49SP4IQC/5oM5 3MAgNOFTP6yRQn9Y9wP+SZs+SsfaIE7yfKa9zOi4S+Ve+LI2v4YFhh8NCRiLkeWZ R1dhp8Pgtuq76f/v0qUaWcuuVeGfJ37M31KOGIhi1sI/3sr7UMrngL8D1+F8UZMi zcLu+x4GtfUZCHl6znx1rNUBqE5S/5ndVhLpOqfCXKaQ+RAm7lkOJ3jXE2VhNkhp yVEmeSOLnlCaQjZvXQ== =OP+o -----END PGP SIGNATURE----- Merge tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "As is pretty normal for this tree, there are changes all over the place, especially for small fixes, selftest improvements, and improved macro usability. Some header changes ended up landing via this tree as they depended on the string header cleanups. Also, a notable set of changes is the work for the reintroduction of the UBSAN signed integer overflow sanitizer so that we can continue to make improvements on the compiler side to make this sanitizer a more viable future security hardening option. Summary: - string.h and related header cleanups (Tanzir Hasan, Andy Shevchenko) - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev, Harshit Mogalapalli) - selftests/powerpc: Fix load_unaligned_zeropad build failure (Michael Ellerman) - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn) - Handle tail call optimization better in LKDTM (Douglas Anderson) - Use long form types in overflow.h (Andy Shevchenko) - Add flags param to string_get_size() (Andy Shevchenko) - Add Coccinelle script for potential struct_size() use (Jacob Keller) - Fix objtool corner case under KCFI (Josh Poimboeuf) - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng) - Add str_plural() helper (Michal Wajdeczko, Kees Cook) - Ignore relocations in .notes section - Add comments to explain how __is_constexpr() works - Fix m68k stack alignment expectations in stackinit Kunit test - Convert string selftests to KUnit - Add KUnit tests for fortified string functions - Improve reporting during fortified string warnings - Allow non-type arg to type_max() and type_min() - Allow strscpy() to be called with only 2 arguments - Add binary mode to leaking_addresses scanner - Various small cleanups to leaking_addresses scanner - Adding wrapping_*() arithmetic helper - Annotate initial signed integer wrap-around in refcount_t - Add explicit UBSAN section to MAINTAINERS - Fix UBSAN self-test warnings - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL - Reintroduce UBSAN's signed overflow sanitizer" * tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits) selftests/powerpc: Fix load_unaligned_zeropad build failure string: Convert helpers selftest to KUnit string: Convert selftest to KUnit sh: Fix build with CONFIG_UBSAN=y compiler.h: Explain how __is_constexpr() works overflow: Allow non-type arg to type_max() and type_min() VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler() lib/string_helpers: Add flags param to string_get_size() x86, relocs: Ignore relocations in .notes section objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks overflow: Use POD in check_shl_overflow() lib: stackinit: Adjust target string to 8 bytes for m68k sparc: vdso: Disable UBSAN instrumentation kernel.h: Move lib/cmdline.c prototypes to string.h leaking_addresses: Provide mechanism to scan binary files leaking_addresses: Ignore input device status lines leaking_addresses: Use File::Temp for /tmp files MAINTAINERS: Update LEAKING_ADDRESSES details fortify: Improve buffer overflow reporting fortify: Add KUnit tests for runtime overflows ... |
||
Linus Torvalds
|
65d287c7eb |
asm-generic updates for 6.9
Just two small updates this time: - A series I did to unify the definition of PAGE_SIZE through Kconfig, intended to help with a vdso rework that needs the constant but cannot include the normal kernel headers when building the compat VDSO on arm64 and potentially others. - a patch from Yan Zhao to remove the pfn_to_virt() definitions from a couple of architectures after finding they were both incorrect and entirely unused. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmXwEjQACgkQYKtH/8kJ UifwHxAAqXl6R4cZtjUKxHpQoX7TTtBgWyZ9OID8KYt8V/QN+Jme6EhuGV/5CJ1k 5n30PuDvSKPB9865HfCZgh0BDSzSFo2xtc/bDuqiPHO5deNhXUDKX5MowIs3Pf2J EM1OJYiXG/g9vR19uaHvWVA4I1eJk01+Pl5nZ3DA+n9ZYcnM35+HO7EQcH80FGwz jkjN1HizxDmuMDDKn24hrSt6mVoE54JWyeDvklbY4CbwZbtFbtBJiFv3NWTfaxSf MPR1fopgaAkT0aJzUXOh36qDodyqR2tz4M7ucpRKa6/YlOewDN59tFwgwtun0s74 lLJPBqQ6cT8no1VODNnKPb1M5Jh3uzsF1fuhnU6B06Z+1s7sxxqOli1Q0yrpivYY SCAh6WmiCMhHeP/sxfQHRhhrx9l0gOarXh7s4wRJFp+LAi59NuUTeJotoOfboX4M ozeFgW1Rlr+wORzUargRnQiXMLObC/RFdogLgiBJwa8XOI8bOPZg9JfAUPOwbfa2 37IFZRleu+V2NaBF8rS5wRGI8hVp99XSMjlskKLM/645doqNq1cyR9UO68jb1hhF d5X2+BEaEJTHJbXEQ9YtThpNWYzHXL5dFswVJfHDs+CW1FWi5GVqCufZGzr7xihy uNLlVqXLhjM+hU2dDoS4ZshygxN3b8f2qa+GtlIMBYrLcbcjxd4= =X4Cs -----END PGP SIGNATURE----- Merge tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "Just two small updates this time: - A series I did to unify the definition of PAGE_SIZE through Kconfig, intended to help with a vdso rework that needs the constant but cannot include the normal kernel headers when building the compat VDSO on arm64 and potentially others - a patch from Yan Zhao to remove the pfn_to_virt() definitions from a couple of architectures after finding they were both incorrect and entirely unused" * tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: arch: define CONFIG_PAGE_SIZE_*KB on all architectures arch: simplify architecture specific page size configuration arch: consolidate existing CONFIG_PAGE_SIZE_*KB definitions mm: Remove broken pfn_to_virt() on arch csky/hexagon/openrisc |
||
Linus Torvalds
|
691632f0e8 |
s390 updates for 6.9 merge window
- Various virtual vs physical address usage fixes - Fix error handling in Processor Activity Instrumentation device driver, and export number of counters with a sysfs file - Allow for multiple events when Processor Activity Instrumentation counters are monitored in system wide sampling - Change multiplier and shift values of the Time-of-Day clock source to improve steering precision - Remove a couple of unneeded GFP_DMA flags from allocations - Disable mmap alignment if randomize_va_space is also disabled, to avoid a too small heap - Various changes to allow s390 to be compiled with LLVM=1, since ld.lld and llvm-objcopy will have proper s390 support witch clang 19 - Add __uninitialized macro to Compiler Attributes. This is helpful with s390's FPU code where some users have up to 520 byte stack frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the intention (performance improvement) of such code sections. - Convert switch_to() to an out-of-line function, and use the generic switch_to header file - Replace the usage of s390's debug feature with pr_debug() calls within the zcrypt device driver - Improve hotplug support of the Adjunct Processor device driver - Improve retry handling in the zcrypt device driver - Various changes to the in-kernel FPU code: - Make in-kernel FPU sections preemptible - Convert various larger inline assemblies and assembler files to C, mainly by using singe instruction inline assemblies. This increases readability, but also allows makes it easier to add proper instrumentation hooks - Cleanup of the header files - Provide fast variants of csum_partial() and csum_partial_copy_nocheck() based on vector instructions - Introduce and use a lock to synchronize accesses to zpci device data structures to avoid inconsistent states caused by concurrent accesses - Compile the kernel without -fPIE. This addresses the following problems if the kernel is compiled with -fPIE: - It uses dynamic symbols (.dynsym), for which the linker refuses to allow more than 64k sections. This can break features which use '-ffunction-sections' and '-fdata-sections', including kpatch-build and function granular KASLR - It unnecessarily uses GOT relocations, adding an extra layer of indirection for many memory accesses - Fix shared_cpu_list for CPU private L2 caches, which incorrectly were reported as globally shared -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmXu3jEACgkQIg7DeRsp bsJC8A/9Gi9JSMKWpIDR4WE2MQGwP/PnYdEamtK6c9ewOjIR/UzRIyIM3J1pyV0L RwL8k7EBuv3f7shTcwfPzZWlnAwNwqr1UdcafjFNtHTig50YtdP5fBL33frKHBrm ATedlCjagojOuVbh1gB45WUgzjSSkPyn0vqwjjo4h6uEAQ35zMEWwCs5Hpajlkhi GCdJaiBLJcnhT96QGurQdke+MsrpGCzeBVBnA0qopQEWaQo8OdiAJ1uMD2WKbgPR 817kNzvmE6nXnfd5JevYbaiLjK/HQUSw2dZUS6/fjuIrzTsZEUhSg4ECaprKXDg7 5qiVVPNg4WbJAp0SsB+w7c4U99VxhbS7IVHXju18GrXw6SSAupdxIo7R7YiaT8vC YIXZ1uIQ4Vbts3w/UqWUczIl/ooQt2DdrWT5NDNA+84OlOM42rthzA3vznTWuPTb U21R7cZmN++hAUjR6s4aO2LfS7HQdnKL8nvJW2y99qSfrOXm+M973W2pDhYEVXQh ixQ/lxfQpbBT1yUGlquIErokCPB85VY6ZTdGu6Erziywf4CWGsT5CspyaQnX2KTJ s4CpFPnilrW3OnxmIkrM+pNJDun1nnkGA388Xq1NEKX8Oe65OMXEFNCb0kAHQ1ua vb6534Ib/iuPnxsGpz1sX9iRqtUd06aBovPcbwIvatHCSfkWws8= =KZ31 -----END PGP SIGNATURE----- Merge tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Heiko Carstens: - Various virtual vs physical address usage fixes - Fix error handling in Processor Activity Instrumentation device driver, and export number of counters with a sysfs file - Allow for multiple events when Processor Activity Instrumentation counters are monitored in system wide sampling - Change multiplier and shift values of the Time-of-Day clock source to improve steering precision - Remove a couple of unneeded GFP_DMA flags from allocations - Disable mmap alignment if randomize_va_space is also disabled, to avoid a too small heap - Various changes to allow s390 to be compiled with LLVM=1, since ld.lld and llvm-objcopy will have proper s390 support witch clang 19 - Add __uninitialized macro to Compiler Attributes. This is helpful with s390's FPU code where some users have up to 520 byte stack frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the intention (performance improvement) of such code sections. - Convert switch_to() to an out-of-line function, and use the generic switch_to header file - Replace the usage of s390's debug feature with pr_debug() calls within the zcrypt device driver - Improve hotplug support of the Adjunct Processor device driver - Improve retry handling in the zcrypt device driver - Various changes to the in-kernel FPU code: - Make in-kernel FPU sections preemptible - Convert various larger inline assemblies and assembler files to C, mainly by using singe instruction inline assemblies. This increases readability, but also allows makes it easier to add proper instrumentation hooks - Cleanup of the header files - Provide fast variants of csum_partial() and csum_partial_copy_nocheck() based on vector instructions - Introduce and use a lock to synchronize accesses to zpci device data structures to avoid inconsistent states caused by concurrent accesses - Compile the kernel without -fPIE. This addresses the following problems if the kernel is compiled with -fPIE: - It uses dynamic symbols (.dynsym), for which the linker refuses to allow more than 64k sections. This can break features which use '-ffunction-sections' and '-fdata-sections', including kpatch-build and function granular KASLR - It unnecessarily uses GOT relocations, adding an extra layer of indirection for many memory accesses - Fix shared_cpu_list for CPU private L2 caches, which incorrectly were reported as globally shared * tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (117 commits) s390/tools: handle rela R_390_GOTPCDBL/R_390_GOTOFF64 s390/cache: prevent rebuild of shared_cpu_list s390/crypto: remove retry loop with sleep from PAES pkey invocation s390/pkey: improve pkey retry behavior s390/zcrypt: improve zcrypt retry behavior s390/zcrypt: introduce retries on in-kernel send CPRB functions s390/ap: introduce mutex to lock the AP bus scan s390/ap: rework ap_scan_bus() to return true on config change s390/ap: clarify AP scan bus related functions and variables s390/ap: rearm APQNs bindings complete completion s390/configs: increase number of LOCKDEP_BITS s390/vfio-ap: handle hardware checkstop state on queue reset operation s390/pai: change sampling event assignment for PMU device driver s390/boot: fix minor comment style damages s390/boot: do not check for zero-termination relocation entry s390/boot: make type of __vmlinux_relocs_64_start|end consistent s390/boot: sanitize kaslr_adjust_relocs() function prototype s390/boot: simplify GOT handling s390: vmlinux.lds.S: fix .got.plt assertion s390/boot: workaround current 'llvm-objdump -t -j ...' behavior ... |
||
Linus Torvalds
|
d08c407f71 |
A large set of updates and features for timers and timekeeping:
- The hierarchical timer pull model When timer wheel timers are armed they are placed into the timer wheel of a CPU which is likely to be busy at the time of expiry. This is done to avoid wakeups on potentially idle CPUs. This is wrong in several aspects: 1) The heuristics to select the target CPU are wrong by definition as the chance to get the prediction right is close to zero. 2) Due to #1 it is possible that timers are accumulated on a single target CPU 3) The required computation in the enqueue path is just overhead for dubious value especially under the consideration that the vast majority of timer wheel timers are either canceled or rearmed before they expire. The timer pull model avoids the above by removing the target computation on enqueue and queueing timers always on the CPU on which they get armed. This is achieved by having separate wheels for CPU pinned timers and global timers which do not care about where they expire. As long as a CPU is busy it handles both the pinned and the global timers which are queued on the CPU local timer wheels. When a CPU goes idle it evaluates its own timer wheels: - If the first expiring timer is a pinned timer, then the global timers can be ignored as the CPU will wake up before they expire. - If the first expiring timer is a global timer, then the expiry time is propagated into the timer pull hierarchy and the CPU makes sure to wake up for the first pinned timer. The timer pull hierarchy organizes CPUs in groups of eight at the lowest level and at the next levels groups of eight groups up to the point where no further aggregation of groups is required, i.e. the number of levels is log8(NR_CPUS). The magic number of eight has been established by experimention, but can be adjusted if needed. In each group one busy CPU acts as the migrator. It's only one CPU to avoid lock contention on remote timer wheels. The migrator CPU checks in its own timer wheel handling whether there are other CPUs in the group which have gone idle and have global timers to expire. If there are global timers to expire, the migrator locks the remote CPU timer wheel and handles the expiry. Depending on the group level in the hierarchy this handling can require to walk the hierarchy downwards to the CPU level. Special care is taken when the last CPU goes idle. At this point the CPU is the systemwide migrator at the top of the hierarchy and it therefore cannot delegate to the hierarchy. It needs to arm its own timer device to expire either at the first expiring timer in the hierarchy or at the first CPU local timer, which ever expires first. This completely removes the overhead from the enqueue path, which is e.g. for networking a true hotpath and trades it for a slightly more complex idle path. This has been in development for a couple of years and the final series has been extensively tested by various teams from silicon vendors and ran through extensive CI. There have been slight performance improvements observed on network centric workloads and an Intel team confirmed that this allows them to power down a die completely on a mult-die socket for the first time in a mostly idle scenario. There is only one outstanding ~1.5% regression on a specific overloaded netperf test which is currently investigated, but the rest is either positive or neutral performance wise and positive on the power management side. - Fixes for the timekeeping interpolation code for cross-timestamps: cross-timestamps are used for PTP to get snapshots from hardware timers and interpolated them back to clock MONOTONIC. The changes address a few corner cases in the interpolation code which got the math and logic wrong. - Simplifcation of the clocksource watchdog retry logic to automatically adjust to handle larger systems correctly instead of having more incomprehensible command line parameters. - Treewide consolidation of the VDSO data structures. - The usual small improvements and cleanups all over the place. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuAN0THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoVKXEADIR45rjR1Xtz32js7B53Y65O4WNoOQ 6/ycWcswuGzg/h4QUpPSJ6gOGVmKSWwZi4n0P/VadCiXGSPPm0aUKsoRUt9DZsPY mtj2wjCSXKXiyhTl9OtrZME86ZAIGO1dQXa/sOHsiP5PCjgQkD0b5CYi1+B6eHDt 1/Uo2Tb9g8VAPppq20V5Uo93GrPf642oyi3FCFrR1M112Uuak5DmqHJYiDpreNcG D5SgI+ykSiaUaVyHifvqijoJk0rYXkqEC6evl02477lJ/X0vVo2/M8XPS95BxHST s5Iruo4rP+qeAy8QvhZpoPX59fO0m/AgA7cf77XXAtOpVdLH+bs4ILsEbouAIOtv lsmRkcYt+TpvrZFHPAxks+6g3afuROiDtxD5sXXpVWxvofi8FwWqubdlqdsbw9MP ZCTNyzNyKL47QeDwBfSynYUL1RSyqsphtIwk4oeQklH9rwMAnW21hi30z15hQ0pQ FOVkmcwi79JNvl/G+jRkDzw7r8/zcHshWdSjyUM04CDjjnCDjQOFWSIjEPwbQjjz S4HXpJKJW963dBgs9Z84/Ctw1GwoBk1qedDWDJE1257Qvmo/Wpe/7GddWcazOGnN RRFMzGPbOqBDbjtErOKGU+iCisgNEvz2XK+TI16uRjWde7DxZpiTVYgNDrZ+/Pyh rQ23UBms6ZRR+A== =iQlu -----END PGP SIGNATURE----- Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A large set of updates and features for timers and timekeeping: - The hierarchical timer pull model When timer wheel timers are armed they are placed into the timer wheel of a CPU which is likely to be busy at the time of expiry. This is done to avoid wakeups on potentially idle CPUs. This is wrong in several aspects: 1) The heuristics to select the target CPU are wrong by definition as the chance to get the prediction right is close to zero. 2) Due to #1 it is possible that timers are accumulated on a single target CPU 3) The required computation in the enqueue path is just overhead for dubious value especially under the consideration that the vast majority of timer wheel timers are either canceled or rearmed before they expire. The timer pull model avoids the above by removing the target computation on enqueue and queueing timers always on the CPU on which they get armed. This is achieved by having separate wheels for CPU pinned timers and global timers which do not care about where they expire. As long as a CPU is busy it handles both the pinned and the global timers which are queued on the CPU local timer wheels. When a CPU goes idle it evaluates its own timer wheels: - If the first expiring timer is a pinned timer, then the global timers can be ignored as the CPU will wake up before they expire. - If the first expiring timer is a global timer, then the expiry time is propagated into the timer pull hierarchy and the CPU makes sure to wake up for the first pinned timer. The timer pull hierarchy organizes CPUs in groups of eight at the lowest level and at the next levels groups of eight groups up to the point where no further aggregation of groups is required, i.e. the number of levels is log8(NR_CPUS). The magic number of eight has been established by experimention, but can be adjusted if needed. In each group one busy CPU acts as the migrator. It's only one CPU to avoid lock contention on remote timer wheels. The migrator CPU checks in its own timer wheel handling whether there are other CPUs in the group which have gone idle and have global timers to expire. If there are global timers to expire, the migrator locks the remote CPU timer wheel and handles the expiry. Depending on the group level in the hierarchy this handling can require to walk the hierarchy downwards to the CPU level. Special care is taken when the last CPU goes idle. At this point the CPU is the systemwide migrator at the top of the hierarchy and it therefore cannot delegate to the hierarchy. It needs to arm its own timer device to expire either at the first expiring timer in the hierarchy or at the first CPU local timer, which ever expires first. This completely removes the overhead from the enqueue path, which is e.g. for networking a true hotpath and trades it for a slightly more complex idle path. This has been in development for a couple of years and the final series has been extensively tested by various teams from silicon vendors and ran through extensive CI. There have been slight performance improvements observed on network centric workloads and an Intel team confirmed that this allows them to power down a die completely on a mult-die socket for the first time in a mostly idle scenario. There is only one outstanding ~1.5% regression on a specific overloaded netperf test which is currently investigated, but the rest is either positive or neutral performance wise and positive on the power management side. - Fixes for the timekeeping interpolation code for cross-timestamps: cross-timestamps are used for PTP to get snapshots from hardware timers and interpolated them back to clock MONOTONIC. The changes address a few corner cases in the interpolation code which got the math and logic wrong. - Simplifcation of the clocksource watchdog retry logic to automatically adjust to handle larger systems correctly instead of having more incomprehensible command line parameters. - Treewide consolidation of the VDSO data structures. - The usual small improvements and cleanups all over the place" * tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) timer/migration: Fix quick check reporting late expiry tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n vdso/datapage: Quick fix - use asm/page-def.h for ARM64 timers: Assert no next dyntick timer look-up while CPU is offline tick: Assume timekeeping is correctly handed over upon last offline idle call tick: Shut down low-res tick from dying CPU tick: Split nohz and highres features from nohz_mode tick: Move individual bit features to debuggable mask accesses tick: Move got_idle_tick away from common flags tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING tick: Move tick cancellation up to CPUHP_AP_TICK_DYING tick: Start centralizing tick related CPU hotplug operations tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick() tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick() tick: Use IS_ENABLED() whenever possible tick/sched: Remove useless oneshot ifdeffery tick/nohz: Remove duplicate between lowres and highres handlers tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer() hrtimer: Select housekeeping CPU during migration ... |
||
Linus Torvalds
|
d451b075f7 |
linux_kselftest-next-6.9-rc1
This kselftest next update for Linux 6.9-rc1 consists of: -- livepatch restructuring to move the module out of lib to be built as a out-of-tree modules during kselftest build. This change makes it easier change, debug and rebuild the tests by running make on the selftests/livepatch directory, which is not currently possible since the modules on lib/livepatch are build and installed using the main makefile modules target. -- livepatch restructuring fixes for problems found by kernel test robot. The change skips the test if kernel-devel isn't installed (default value of KDIR), or if KDIR variable passed doesn't exists. -- resctrl test restructuring and new non-contiguous CBMs CAT test -- new ktap_helpers to print diagnostic messages, pass/fail tests based on exit code, abort test, and finish the test. -- a new test verify power supply properties. -- a new ftrace to exercise function tracer across cpu hotplug. -- timeout increase for mqueue test to allow the test to run on i3.metal AWS instances. -- minor spelling corrections in several tests. -- missing gitignore files and changes to existing gitignore files. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmXo/kUACgkQCwJExA0N Qxy0aBAAk0SLA1ZIAdlNjo5B13C7GC7rFRrtaai9ReXSvU/X5TX9sD5T9DIULKdj Mcqi+oaP88GPSUZS+bn7DyVxKyuvHg/f4jWQwqZ34WxK4K1K+yt+3YhTnHZx7ezU 6WIbUsD1Zs7tXXI2v76riHFbD3pfxZ+AXQaf/1cXDi4SpIpLkiqyeYWoWN5Z2rtJ BwMzrI2RBiLMox4g8F3Ey4BX+bOIYiiJq5bdl7gJVKcp74VdU3S7IyOuXFbSdcFR xxmFMxWGFOgRzexW0fmDWLudD2dII0XQAExSsl5xMnR/lmSh+lHWheoNgphQl050 VcLmrPugWVJSioe0fHEgmDQXe3lPqDtepUg921tIlWvCmtR3Ur6+GpILTbSvQ4qp SK+2pt7nGSAT2UkRO/6/TYFG3mELADvj6tglj0b1SkIXmNiF+7OZ+hJ2XqyM7peo Z7gtmSmpbAotxp64Jj8HsNZLpCX0xdaxoTMEWPoG09fwTXY7Hy03yoWDKBKB4MZ9 jBtNXDolhpEQ/ppSGFnRPzXuNVapYX28UY0cwBBVgke5jwB8SUnBEr2dbNnVU1q0 y5uxtj/EFQzxSynB3eM1us2OuXvr5TfAWmKVpyE/cNC3WreHeA+Y2kN1dzv8hgpw o4NbltdF8F+a9qQF9B1XvjVhqa5By1esS1jOg96cJgGseAVWiQs= =G+DO -----END PGP SIGNATURE----- Merge tag 'linux_kselftest-next-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest update from Shuah Khan: - livepatch restructuring to move the module out of lib to be built as a out-of-tree modules during kselftest build. This makes it easier change, debug and rebuild the tests by running make on the selftests/livepatch directory, which is not currently possible since the modules on lib/livepatch are build and installed using the main makefile modules target. - livepatch restructuring fixes for problems found by kernel test robot. The change skips the test if kernel-devel isn't installed (default value of KDIR), or if KDIR variable passed doesn't exists. - resctrl test restructuring and new non-contiguous CBMs CAT test - new ktap_helpers to print diagnostic messages, pass/fail tests based on exit code, abort test, and finish the test. - a new test verify power supply properties. - a new ftrace to exercise function tracer across cpu hotplug. - timeout increase for mqueue test to allow the test to run on i3.metal AWS instances. - minor spelling corrections in several tests. - missing gitignore files and changes to existing gitignore files. * tag 'linux_kselftest-next-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (57 commits) kselftest: Add basic test for probing the rust sample modules selftests: lib.mk: Do not process TEST_GEN_MODS_DIR selftests: livepatch: Avoid running the tests if kernel-devel is missing selftests: livepatch: Add initial .gitignore selftests/resctrl: Add non-contiguous CBMs CAT test selftests/resctrl: Add resource_info_file_exists() selftests/resctrl: Split validate_resctrl_feature_request() selftests/resctrl: Add a helper for the non-contiguous test selftests/resctrl: Add test groups and name L3 CAT test L3_CAT selftests: sched: Fix spelling mistake "hiearchy" -> "hierarchy" selftests/mqueue: Set timeout to 180 seconds selftests/ftrace: Add test to exercize function tracer across cpu hotplug selftest: ftrace: fix minor typo in log selftests: thermal: intel: workload_hint: add missing gitignore selftests: thermal: intel: power_floor: add missing gitignore selftests: uevent: add missing gitignore selftests: Add test to verify power supply properties selftests: ktap_helpers: Add a helper to finish the test selftests: ktap_helpers: Add a helper to abort the test selftests: ktap_helpers: Add helper to pass/fail test based on exit code ... |
||
Paolo Bonzini
|
e9a2bba476 |
KVM Xen and pfncache changes for 6.9:
- Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrblYACgkQOlYIJqCj N/3K4Q/+KZ8lrnNXvdHNCQdosA5DDXpqUcRzhlTUp82fncpdJ0LqrSMzMots2Eh9 KC0jSPo8EkivF+Epug0+bpQBEaLXzTWhRcS1grePCDz2lBnxoHFSWjvaK2p14KlC LvxCJZjxyfLKHwKHpSndvO9hVFElCY3mvvE9KRcKeQAmrz1cz+DDMKelo1MuV8D+ GfymhYc+UXpY41+6hQdznx+WoGoXKRameo3iGYuBoJjvKOyl4Wxkx9WSXIxxxuqG kHxjiWTR/jF1ITJl6PeMrFcGl3cuGKM/UfTOM6W2h6Wi3mhLpXveoVLnqR1kipIj btSzSVHL7C4WTPwOcyhwPzap+dJmm31c6N0uPScT7r9yhs+q5BDj26vcVcyPZUHo efIwmsnO2eQvuw+f8C6QqWCPaxvw46N0zxzwgc5uA3jvAC93y0l4v+xlAQsC0wzV 0+BwU00cutH/3t3c/WPD5QcmRLH726VoFuTlaDufpoMU7gBVJ8rzjcusxR+5BKT+ GJcAgZxZhEgvnzmTKd4Ec/mt+xZ2Erd+kV3MKCHvDPyj8jqy8FQ4DAWKGBR+h3WR rqAs2k8NPHyh3i1a3FL1opmxEGsRS+Cnc6Bi77cj9DxTr22JkgDJEuFR+Ues1z6/ SpE889kt3w5zTo34+lNxNPlIKmO0ICwwhDL6pxJTWU7iWQnKypU= =GliW -----END PGP SIGNATURE----- Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock. |
||
Paolo Bonzini
|
233d0bc4d8 |
LoongArch KVM changes for v6.9
1. Set reserved bits as zero in CPUCFG. 2. Start SW timer only when vcpu is blocking. 3. Do not restart SW timer when it is expired. 4. Remove unnecessary CSR register saving during enter guest. -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmXoeWIWHGNoZW5odWFj YWlAa2VybmVsLm9yZwAKCRAChivD8uImehb3D/9C5IrdyU/2f3fEUuuXO0a2ZS1p l2OT+yr7C6/jATokGcd+53CF8MzYawzuAT3tSXYyoqAxRu0HUkvuS1oA/eFM4EwV iIoUC3jnqcsQ5LCPt6yt+Tzgug64Xm5F4btYWIpmXgCJWx/VVG6+z3JarXAfA2it vgVMGgrrfHt68sEsenNFNgiJ5tCCubjR7XFwjM8rsL7AzUDdmXpF7gFyH2Ufgosi a5CxcPPauO1y5ZCGU4JU9QvxnVqW1kt/TRZIGqqGfULtlBSoZbD9zP3OcCQkL+ai SPNxvU5I+BeX6honpmO6aR/F1EphQhRji3ZKxI8UBo4aJD5+FtMG/YOEPI+ZAS0/ JPuWpDqJH46SN3jfKTQay8jXc+mcnOYXJ9Yrixd4UCf66WJit/+BOma/wP638u2j RUzm1kqhNGad6QiDDtSjISM6sg6FozAGc/KhCkWAhV+lHLnfkXtaf3S+GIu5OiWz ETCKlmIGiy0y774+iftlD7RDRGmtrC4cx5ibl7cKKi62Y5vgujCdDofAyYC+D5cW puaIuHOx1hWtPRT9p1WfUL310ED+Qj3N2pDDcJcqdCIiRRZ5l/hxGS7V687a30WV GcegEqh19CjI9KDat4E1ld4jUHJxaFrw3pr2z3SP7cW3IgdngPJL57M0M2jSazaQ 479xZPJ/i4xhJaKACg== =8HOW -----END PGP SIGNATURE----- Merge tag 'loongarch-kvm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.9 * Set reserved bits as zero in CPUCFG. * Start SW timer only when vcpu is blocking. * Do not restart SW timer when it is expired. * Remove unnecessary CSR register saving during enter guest. |
||
Paolo Bonzini
|
7d8942d8e7 |
KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXZB/8ACgkQOlYIJqCj N/3XlQ//RIsvqr38k7kELSKhCMyWgF4J57itABrHpMqAZu3gaAo5sETX8AGcHEe5 mxmquxyNQSf4cthhWy1kzxjGCy6+fk+Z0Z7wzfz0Yd5D+FI6vpo3HhkjovLb2gpt kSrHuhJyuj2vkftNvdaz0nHX1QalVyIEnXnR3oqTmxUUsg6lp1x/zr5SP0KBXjo8 ZzJtyFd0fkRXWpA792T7XPRBWrzPV31HYZBLX8sPlYmJATcbIx9rYSThgCN6XuVN bfE6wATsC+mwv5BpCoDFpCKmFcqSqamag9NGe5qE5mOby5DQGYTCRMCQB8YXXBR0 97ppaY9ZJV4nOVjrYJn6IMOSMVNfoG7nTRFfcd0eFP4tlPEgHwGr5BGDaBtQPkrd KcgWJw8nS02eCA2iOE+FtCXvGJwKhTTjQ45w7rU4EcfUk603L5J4GO1ddmjMhPcP upGGcWDK9vCGrSUFTm8pyWp/NKRJPvAQEiQd/BweSk9+isQHTX2RYCQgPAQnwlTS wTg7ZPNSLoUkRYmd6r+TUT32ELJGNc8GLftMnxIwweq6V7AgNMi0HE60eMovuBNO 7DAWWzfBEZmJv+0mNNZPGXczHVv4YvMWysRdKkhztBc3+sO7P3AL1zWIDlm5qwoG LpFeeI3qo3o5ZNaqGzkSop2pUUGNGpWCH46WmP0AG7RpzW/Natw= =M0td -----END PGP SIGNATURE----- Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged. |