linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-15 13:15:57 +00:00

Author	SHA1	Message	Date
Stephen Rothwell	51f36d61e6	Merge branch 'driver-core-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git	2025-01-13 09:12:08 +11:00
Stephen Rothwell	d86f9c94da	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git	2025-01-13 09:12:07 +11:00
Stephen Rothwell	ca9a7452fe	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git	2025-01-13 09:12:05 +11:00
Stephen Rothwell	dc7a2d6478	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git	2025-01-13 09:12:04 +11:00
Stephen Rothwell	306c7a2514	Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git	2025-01-13 09:12:02 +11:00
Stephen Rothwell	df6b88453f	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git	2025-01-13 09:12:01 +11:00
Stephen Rothwell	0bfcaac919	Merge branch 'fs-current' of linux-next	2025-01-13 09:11:59 +11:00
Stephen Rothwell	b4d47e5151	Merge branch 'mm-hotfixes-unstable' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm	2025-01-13 09:11:58 +11:00
Stephen Rothwell	8dc23862ab	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git	2025-01-13 08:49:33 +11:00
Stephen Rothwell	614f64f21d	Merge branch 'next-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git	2025-01-13 08:49:32 +11:00
Linus Torvalds	be54864552	s390: * fix a latent bug when the kernel is compiled in debug mode * two small UCONTROL fixes and their selftests arm64: * always check page state in hyp_ack_unshare() * align set_id_regs selftest with the fact that ASIDBITS field is RO * various vPMU fixes for bugs that only affect nested virt PPC e500: * Fix a mostly impossible (but just wrong) case where IRQs were never re-enabled * Observe host permissions instead of mapping readonly host pages as guest-writable. This fixes a NULL-pointer dereference in 6.13 * Replace brittle VMA-based attempts at building huge shadow TLB entries with PTE lookups. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmeDrZkUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMX/wf/VPTDn+T0EcWBduH+hJ/g0uxM0TqM vHmoL8sXrH9yVrqcdlOVXDc9R6h7q44cuOM/ZPBZuR4QAJqcL7b9KDDQWyOphzQt U6rKDbnDZyEAC5Ti586qncKGoEv7vgL2LOVW5MUyyYzyWAKmNs2HNSIQkKmDEPiQ KXWMhtncC3x+TbesbwMxwE/F94M7PFVynaoODH/dY8C7ZsPykIBRuUz4WSE/411b Dy8TOYyesPg4pCT8YASHCt8PIwwBUXpEUfhEqElOGM4I3rOfWXFe1w1VyizXLyO5 lULZ3zfP6+jlWxbz9Nyvb5rSRJO8a6PfmqhsbFavGQau0u/3AR2cLTA92w== =UpHe -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "The largest part here is for KVM/PPC, where a NULL pointer dereference was introduced in the 6.13 merge window and is now fixed. There's some "holiday-induced lateness", as the s390 submaintainer put it, but otherwise things looks fine. s390: - fix a latent bug when the kernel is compiled in debug mode - two small UCONTROL fixes and their selftests arm64: - always check page state in hyp_ack_unshare() - align set_id_regs selftest with the fact that ASIDBITS field is RO - various vPMU fixes for bugs that only affect nested virt PPC e500: - Fix a mostly impossible (but just wrong) case where IRQs were never re-enabled - Observe host permissions instead of mapping readonly host pages as guest-writable. This fixes a NULL-pointer dereference in 6.13 - Replace brittle VMA-based attempts at building huge shadow TLB entries with PTE lookups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: e500: perform hugepage check after looking up the PFN KVM: e500: map readonly host pages for read KVM: e500: track host-writability of pages KVM: e500: use shadow TLB entry as witness for writability KVM: e500: always restore irqs KVM: s390: selftests: Add has device attr check to uc_attr_mem_limit selftest KVM: s390: selftests: Add ucontrol gis routing test KVM: s390: Reject KVM_SET_GSI_ROUTING on ucontrol VMs KVM: s390: selftests: Add ucontrol flic attr selftests KVM: s390: Reject setting flic pfault attributes on ucontrol VMs KVM: s390: vsie: fix virtual/physical address in unpin_scb() KVM: arm64: Only apply PMCR_EL0.P to the guest range of counters KVM: arm64: nv: Reload PMU events upon MDCR_EL2.HPME change KVM: arm64: Use KVM_REQ_RELOAD_PMU to handle PMCR_EL0.E change KVM: arm64: Add unified helper for reprogramming counters by mask KVM: arm64: Always check the state from hyp_ack_unshare() KVM: arm64: Fix set_id_regs selftest for ASIDBITS becoming unwritable	2025-01-12 12:04:53 -08:00
Linus Torvalds	a603abe345	- Fix a #GP in the perf user callchain code caused by a race between uprobe freeing the task and the bpf profiler unwinding the task's user stack -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeDphkACgkQEsHwGGHe VUrFjxAAickP9S3nlduOzjOO9Pa85MUbQ5wgzrpa29KV75xez9w7IWmambBbYkrY zxV/vJqVEjuaJki/kqgtPNmp7tHjDBwW/sTqSI8TTeIwogfht4WPPA2YEHR2pDK4 t8XNEHGnP38o1oJ6j+zLO9vktieJ/T65yZurmGwVfmGpNOIHNBSzCFopGFCXV41k WcNi1E3dOgSbAQESvF+J1ZtkcmBXovoyE7k+H5bbuRcoyFF1RhIDvKcGGY5m7FDo Cb92wJTbm9kQaWdOc8oa808pyVtmh0wy+1I9dvoQ+sPlhLzy4p32uOpWUlJkpV51 lZgPO0NunnLlHNL4zK4M7OBphlEbr8JaXQbgDLtn8TnfPKlh1sZ0DWoVcyXqB77g cOlsSEDYzSbf/5TKDZMfeh4koEZvtNmDH6SjUYxC6bdfpfd8D5zp8TbvPJ6XmM8m tFn4rhTY5rf2+AjgZs16jkpNlDk+pmwXiczxhldMR/U9y5meea96pe+r8HPpQk27 1t9N0ixt+EhY1xkITYkS06UV/nJJzejbtrCytkh/FLePQCSi+IbgpxUASVHnJSur 4ctWZTm+1CxZ7SRZ9VEsPYXfRfRtJjPKOqheQR2RNRi9SnBi7AlJfMOffEZqj8/p q8C2qtwOlBdxo/t87NnTsvmZE3mfWJBgN2KmO/5YsshRx15qPis= =oobp -----END PGP SIGNATURE----- Merge tag 'perf_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Fix a #GP in the perf user callchain code caused by a race between uprobe freeing the task and the bpf profiler unwinding the task's user stack * tag 'perf_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: uprobes: Fix race in uprobe_free_utask	2025-01-12 11:57:45 -08:00
Linus Torvalds	f31acaef55	- Check whether shadow stack is active before using the ptrace regset getter - Remove a wrong BUG_ON in the early static call code which breaks Xen PVH when booting as dom0 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeDpCEACgkQEsHwGGHe VUriexAAgI+uUA8YuyTbX9vFiqbu1lijWiNc9uC+2lqiXAz4gEZ8pA6JlxCR+qIx rO5u9wB889NOp8NGmeicEH6vt3eEKN8kSNO3tCE1qBVzrcsJtTqp92kMq0PPom5u EDhxCFwWIvkcarxt7E0tQI1jkRraI1repwXAFBkIjifr4FzXQ+BsoVpY9CFoKQtl HhWtLzCpyVK9T8WbzJ3SQ3mzykk7MRdvoobIrKMYSgeFUxzCmTM1eMY5zMkKHwIF SmfjpboJFqWrjxTLvU+7McrcFnTuy3sNOERZIquksGfPd8UDdB2xCdZqXKpvYHnW e6+NBPPO9Ht3zIRQKRz+/+oxX83zj0t8iGyBti/lUms33FBL00WhVl8IXT3xt+au lkdL5jVEmnqcOKQsRUGZB+dmm8W9bcEETijx42O0pODrvhE/vOb+AC/n9Wo1ArrP 6JuZ7V1A/mU6Cjrij5IrXcj4TBJbDbRxPR2i+jGdb58DgqwwRBNEaFEmm/Cr0aZx eoSxgTzT6ZrBs/+duffHWb8ALrM4JUJd/9StORpwDNmBj9FA7Gqig2MkDfbQweeg 4s5RT1AYHwTTbmnS3GELmUnkyTREdexzu3gXYovR+xAK/3zqBccvESnLRPS2Z7oA 6AwfhyI5k/8A8oyafeXwQCOZTOxJmwmuDzCizyGc1k+aYDxCp4Q= =d+mX -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Check whether shadow stack is active before using the ptrace regset getter - Remove a wrong BUG_ON in the early static call code which breaks Xen PVH when booting as dom0 * tag 'x86_urgent_for_v6.13_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Ensure shadow stack is active before "getting" registers x86/static-call: Remove early_boot_irqs_disabled check to fix Xen PVH dom0	2025-01-12 11:55:48 -08:00
Paolo Bonzini	a5546c2f0d	KVM: s390: three small bugfixes Fix a latent bug when the kernel is compiled in debug mode. Two small UCONTROL fixes and their selftests. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEoWuZBM6M3lCBSfTnuARItAMU6BMFAmd+aN0ACgkQuARItAMU 6BOtohAApC3dPKw1pwlxnvZLSHm7TizFpvczHqqyF/sglz1js/YYRDmXPy0wJiw+ mYa6qhhzDD1NsCw+7Fq8Zsh8kh0RjeiBLAJY/651wQUJzkXLed1dqei9uegXsGgi TqOZct07avQaxrlfH3VHsaE+QOeqQZai5INoEmk7G1U+zwoMBYRNCq2tyULEixCt 8iMg+5cMbXICmiQ9X2KxkslkWwR3+hpSFpOvVATu6R1PO42fLT/Sqxk/YKAQIaq1 9tNkHTUISjz5vLJgOUfuB5kLCQoCwwuONWBREoyKRtWsoy+QdevphIB4LfqDq43T XOFdSVZa2JNUZa3/HiZJ0UJAk2R1pYXsQJBYgOJMnclqThY677DOetrKR8QNmPwB +a+2WTLkeJzL7T0a2eChhXqpUd03W6MjAnG8AYAK05imV3PKlaZxiV/Nfzik9ghM ab75HzaofX6zgJK6VxfEAggMCi84jDmHzlrwwr6DLFqFcsZ7tV5PoPcWIm3f6SpG l0ARCHXGZ0cIeBwYG7ZIzHVMpzATwea61ov7LLTm9jUAIF0di67ry/e6KY/SJYiP uKI6S6iyjjUl1VbqK2gAiU8fTscAY9oxd54NYdQ9r3gHYjaWI3ikpWz6EMotIvnh bDWv0ZC9fF2qlajrenD+nc7q+s2N1UWaH8NYEkPBeMrY9FHmmD4= =NID3 -----END PGP SIGNATURE----- Merge tag 'kvm-s390-master-6.13-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: three small bugfixes Fix a latent bug when the kernel is compiled in debug mode. Two small UCONTROL fixes and their selftests.	2025-01-12 12:51:05 +01:00
Paolo Bonzini	5c99a684c9	KVM/arm64 changes for 6.13, part #3 - Always check page state in hyp_ack_unshare() - Align set_id_regs selftest with the fact that ASIDBITS field is RO - Various vPMU fixes for bugs that only affect nested virt -----BEGIN PGP SIGNATURE----- iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZ3Cfbxccb2xpdmVyLnVw dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFuc4AQDcDPzJAmtA2xem+JOTPwTOl4Fk TKDfwes3zZ/42/BA8wEA2X4vKcSwQTujdCpRrRiQvegn3pcT4V9oo9TB0kzjoQU= =KEa8 -----END PGP SIGNATURE----- Merge tag 'kvmarm-fixes-6.13-3' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.13, part #3 - Always check page state in hyp_ack_unshare() - Align set_id_regs selftest with the fact that ASIDBITS field is RO - Various vPMU fixes for bugs that only affect nested virt	2025-01-12 12:50:39 +01:00
Paolo Bonzini	71b7bf1702	Merge branch 'kvm-e500-check-writable-pfn' into HEAD The new __kvm_faultin_pfn() function is upset by the fact that e500 KVM ignores host page permissions - __kvm_faultin requires a "writable" outgoing argument, but e500 KVM is passing NULL. While a simple fix would be possible that simply allows writable to be NULL, it is quite ugly to have e500 KVM ignore completely the host permissions and map readonly host pages as guest-writable. Merge a more complete fix and remove the VMA-based attempts at building huge shadow TLB entries. Using a PTE lookup, similar to what is done for x86, is better and works with remap_pfn_range() because it does not assume that VM_PFNMAP areas are contiguous. Note that the same incorrect logic is there in ARM's get_vma_page_shift() and RISC-V's kvm_riscv_gstage_ioremap(). Fortunately, for e500 most of the code is already there; it just has to be changed to compute the range from find_linux_pte()'s output rather than find_vma(). The new code works for both VM_PFNMAP and hugetlb mappings, so the latter is removed. Patches 2-5 were tested by the reporter, Christian Zigotzky. Since the difference with v1 is minimal, I am going to send it to Linus today.	2025-01-12 12:48:14 +01:00
Paolo Bonzini	55f4db79c4	KVM: e500: perform hugepage check after looking up the PFN e500 KVM tries to bypass __kvm_faultin_pfn() in order to map VM_PFNMAP VMAs as huge pages. This is a Bad Idea because VM_PFNMAP VMAs could become noncontiguous as a result of callsto remap_pfn_range(). Instead, use the already existing host PTE lookup to retrieve a valid host-side mapping level after __kvm_faultin_pfn() has returned. Then find the largest size that will satisfy the guest's request while staying within a single host PTE. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-01-12 10:53:00 +01:00
Paolo Bonzini	03b755b2aa	KVM: e500: map readonly host pages for read The new __kvm_faultin_pfn() function is upset by the fact that e500 KVM ignores host page permissions - __kvm_faultin requires a "writable" outgoing argument, but e500 KVM is nonchalantly passing NULL. If the host page permissions do not include writability, the shadow TLB entry is forcibly mapped read-only. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-01-12 10:35:45 +01:00
Paolo Bonzini	f2104bf22f	KVM: e500: track host-writability of pages Add the possibility of marking a page so that the UW and SW bits are force-cleared. This is stored in the private info so that it persists across multiple calls to kvmppc_e500_setup_stlbe. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-01-12 10:35:41 +01:00
Paolo Bonzini	e97fbb43fb	KVM: e500: use shadow TLB entry as witness for writability kvmppc_e500_ref_setup is returning whether the guest TLB entry is writable, which is than passed to kvm_release_faultin_page. This makes little sense for two reasons: first, because the function sets up the private data for the page and the return value feels like it has been bolted on the side; second, because what really matters is whether the _shadow_ TLB entry is writable. If it is not writable, the page can be released as non-dirty. Shift from using tlbe_is_writable(gtlbe) to doing the same check on the shadow TLB entry. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-01-12 10:35:36 +01:00
Paolo Bonzini	87ecfdbc69	KVM: e500: always restore irqs If find_linux_pte fails, IRQs will not be restored. This is unlikely to happen in practice since it would have been reported as hanging hosts, but it should of course be fixed anyway. Cc: stable@vger.kernel.org Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-01-12 10:34:44 +01:00
Luke D. Jones	44a48b2663	ALSA: hda/realtek: fixup ASUS H7606W The H7606W laptop has almost the exact same codec setup as the GA403 and so the same quirks apply to it. Signed-off-by: Luke D. Jones <luke@ljones.dev> Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20250111022754.177551-2-luke@ljones.dev Signed-off-by: Takashi Iwai <tiwai@suse.de>	2025-01-12 10:11:32 +01:00
Luke D. Jones	f67b1ef261	ALSA: hda/realtek: fixup ASUS GA605W The GA605W laptop has almost the exact same codec setup as the GA403 and so the same quirks apply to it. Signed-off-by: Luke D. Jones <luke@ljones.dev> Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20250111022754.177551-1-luke@ljones.dev Signed-off-by: Takashi Iwai <tiwai@suse.de>	2025-01-12 10:11:09 +01:00
Liu Shixin	b2a095302e	mm: khugepaged: fix call hpage_collapse_scan_file() for anonymous vma syzkaller reported such a BUG_ON(): ------------[ cut here ]------------ kernel BUG at mm/khugepaged.c:1835! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP ... CPU: 6 UID: 0 PID: 8009 Comm: syz.15.106 Kdump: loaded Tainted: G W 6.13.0-rc6 #22 Tainted: [W]=WARN Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : collapse_file+0xa44/0x1400 lr : collapse_file+0x88/0x1400 sp : ffff80008afe3a60 ... Call trace: collapse_file+0xa44/0x1400 (P) hpage_collapse_scan_file+0x278/0x400 madvise_collapse+0x1bc/0x678 madvise_vma_behavior+0x32c/0x448 madvise_walk_vmas.constprop.0+0xbc/0x140 do_madvise.part.0+0xdc/0x2c8 __arm64_sys_madvise+0x68/0x88 invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x34/0x128 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x190/0x198 This indicates that the pgoff is unaligned. After analysis, I confirm the vma is mapped to /dev/zero. Such a vma certainly has vm_file, but it is set to anonymous by mmap_zero(). So even if it's mmapped by 2m-unaligned, it can pass the check in thp_vma_allowable_order() as it is an anonymous-mmap, but then be collapsed as a file-mmap. It seems the problem has existed for a long time, but actually, since we have khugepaged_max_ptes_none check before, we will skip collapse it as it is /dev/zero and so has no present page. But commit d8ea7cc8547c limit the check for only khugepaged, so the BUG_ON() can be triggered by madvise_collapse(). Add vma_is_anonymous() check to make such vma be processed by hpage_collapse_scan_pmd(). Link: https://lkml.kernel.org/r/20250111034511.2223353-1-liushixin2@huawei.com Fixes: d8ea7cc8547c ("mm/khugepaged: add flag to predicate khugepaged-only behavior") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mattew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:32 -08:00
Karan Sanghavi	38ad67af73	mm: shmem: use signed int for version handling in casefold option Fixes an issue where the use of an unsigned data type in `shmem_parse_opt_casefold()` caused incorrect evaluation of negative conditions. Link: https://lkml.kernel.org/r/20250111-unsignedcompare1601569-v3-1-c861b4221831@gmail.com Fixes: 58e55efd6c72 ("tmpfs: Add casefold lookup support") Reviewed-by: André Almeida <andrealmeid@igalia.com> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: Karan Sanghavi <karansanghvi98@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickens <hughd@google.com> Cc: Shuah khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:32 -08:00
Carlos Bilbao	1d3d61aef8	mailmap, docs: update email to carlos.bilbao@kernel.org Update .mailmap to reflect my new (and final) primary email address, carlos.bilbao@kernel.org. This ensures consistent attribution in Git history. Also update my contact information in file Documentation/translations/sp_SP/index.rst to help contributors reach out for Spanish translations. Link: https://lkml.kernel.org/r/20250111161110.862131-1-carlos.bilbao@kernel.org Signed-off-by: Carlos Bilbao <carlos.bilbao@kernel.org> Cc: Avadhut Naik <avadhut.naik@amd.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:32 -08:00
Jan Kiszka	6d19ad5985	scripts/gdb: fix aarch64 userspace detection in get_current_task At least recent gdb releases (seen with 14.2) return SP_EL0 as signed long which lets the right-shift always return 0. Link: https://lkml.kernel.org/r/dcd2fabc-9131-4b48-8419-6444e2d67454@siemens.com Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Barry Song <baohua@kernel.org> Cc: Kieran Bingham <kbingham@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:31 -08:00
Li Zhijian	640f36c947	mm/vmscan: fix pgdemote_* accounting with lru_gen_enabled Commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") moved the accounting of PGDEMOTE_* statistics to shrink_inactive_list(). However, shrink_inactive_list() is not called when lrugen_enabled is true, leading to incorrect demotion statistics despite actual demotion events occurring. Add the PGDEMOTE_* accounting in evict_folios(), ensuring that demotion statistics are correctly updated regardless of the lru_gen_enabled state. This fix is crucial for systems that rely on accurate NUMA balancing metrics for performance tuning and resource management. Link: https://lkml.kernel.org/r/20250110122133.423481-2-lizhijian@fujitsu.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:31 -08:00
Li Zhijian	25af7a8727	mm-vmscan-accumulate-nr_demoted-for-accurate-demotion-statistics-v2 introduce local nr_demoted to fix nr_reclaimed double counting Link: https://lkml.kernel.org/r/20250111015253.425693-1-lizhijian@fujitsu.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:31 -08:00
Li Zhijian	6f330f4f2e	mm/vmscan: accumulate nr_demoted for accurate demotion statistics In shrink_folio_list(), demote_folio_list() can be called 2 times. Currently stat->nr_demoted will only store the last nr_demoted( the later nr_demoted is always zero, the former nr_demoted will get lost), as a result number of demoted pages is not accurate. Accumulate the nr_demoted count across multiple calls to demote_folio_list(), ensuring accurate reporting of demotion statistics. Link: https://lkml.kernel.org/r/20250110122133.423481-1-lizhijian@fujitsu.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Acked-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Tested-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:31 -08:00
Rik van Riel	95de1e5b34	fs/proc: fix softlockup in __read_vmcore (part 2) Since commit 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") the number of softlockups in __read_vmcore at kdump time have gone down, but they still happen sometimes. In a memory constrained environment like the kdump image, a softlockup is not just a harmless message, but it can interfere with things like RCU freeing memory, causing the crashdump to get stuck. The second loop in __read_vmcore has a lot more opportunities for natural sleep points, like scheduling out while waiting for a data write to happen, but apparently that is not always enough. Add a cond_resched() to the second loop in __read_vmcore to (hopefully) get rid of the softlockups. Link: https://lkml.kernel.org/r/20250110102821.2a37581b@fangorn Fixes: 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Breno Leitao <leitao@debian.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:30 -08:00
Matthew Wilcox (Oracle)	3bd1b307b6	mm: fix assertion in folio_end_read() We only need to assert that the uptodate flag is clear if we're going to set it. This hasn't been a problem before now because we have only used folio_end_read() when completing with an error, but it's convenient to use it in squashfs if we discover the folio is already uptodate. Link: https://lkml.kernel.org/r/20250110163300.3346321-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:30 -08:00
Donet Tom	8d423383db	mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled. When MGLRU is enabled, the pgdemote_kswapd, pgdemote_direct, and pgdemote_khugepaged stats in vmstat are not being updated. Commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") moved the pgdemote vmstat update from demote_folio_list() to shrink_inactive_list(), which is in the normal LRU path. As a result, the pgdemote stats are updated correctly for the normal LRU but not for MGLRU. To address this, we have added the pgdemote stat update in the evict_folios() function, which is in the MGLRU path. With this patch, the pgdemote stats will now be updated correctly when MGLRU is enabled. Without this patch vmstat output when MGLRU is enabled ====================================================== pgdemote_kswapd 0 pgdemote_direct 0 pgdemote_khugepaged 0 With this patch vmstat output when MGLRU is enabled =================================================== pgdemote_kswapd 43234 pgdemote_direct 4691 pgdemote_khugepaged 0 Link: https://lkml.kernel.org/r/20250109060540.451261-1-donettom@linux.ibm.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:30 -08:00
Koichiro Den	926c34cb29	vmstat: disable vmstat_work on vmstat_cpu_down_prep() The upstream commit adcfb264c3ed ("vmstat: disable vmstat_work on vmstat_cpu_down_prep()") introduced another warning during the boot phase so was soon reverted on upstream by commit cd6313beaeae ("Revert "vmstat: disable vmstat_work on vmstat_cpu_down_prep()""). This commit resolves it and reattempts the original fix. Even after mm/vmstat:online teardown, shepherd may still queue work for the dying cpu until the cpu is removed from online mask. While it's quite rare, this means that after unbind_workers() unbinds a per-cpu kworker, it potentially runs vmstat_update for the dying CPU on an irrelevant cpu before entering atomic AP states. When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the backtrace. BUG: using smp_processor_id() in preemptible [00000000] code: \ kworker/7:3/1702 caller is refresh_cpu_vm_stats+0x235/0x5f0 CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G Tainted: [N]=TEST Workqueue: mm_percpu_wq vmstat_update Call Trace: <TASK> dump_stack_lvl+0x8d/0xb0 check_preemption_disabled+0xce/0xe0 refresh_cpu_vm_stats+0x235/0x5f0 vmstat_update+0x17/0xa0 process_one_work+0x869/0x1aa0 worker_thread+0x5e5/0x1100 kthread+0x29e/0x380 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> So, for mm/vmstat:online, disable vmstat_work reliably on teardown and symmetrically enable it on startup. For secondary CPUs during CPU hotplug scenarios, ensure the delayed work is disabled immediately after the initialization. These CPUs are not yet online when start_shepherd_timer() runs on boot CPU. vmstat_cpu_online() will enable the work for them. Link: https://lkml.kernel.org/r/20250108042807.3429745-1-koichiro.den@canonical.com Signed-off-by: Huacai Chen <chenhuacai@kernel.org> Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Suggested-by: Huacai Chen <chenhuacai@kernel.org> Tested-by: Charalampos Mitrodimas <charmitro@posteo.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:30 -08:00
Yu Zhao	c30c0860f0	mm/hugetlb_vmemmap: fix memory loads ordering Using x86_64 as an example, for a 32KB struct page[] area describing a 2MB hugeTLB, HVO reduces the area to 4KB by the following steps: 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs; 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped by PTE 0, and at the same time change the permission from r/w to r/o; 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB to 4KB. However, the following race can happen due to improperly memory loads ordering: CPU 1 (HVO) CPU 2 (speculative PFN walker) page_ref_freeze() synchronize_rcu() rcu_read_lock() page_is_fake_head() is false vmemmap_remap_pte() XXX: struct page[] becomes r/o page_ref_unfreeze() page_ref_count() is not zero atomic_add_unless(&page->_refcount) XXX: try to modify r/o struct page[] Specifically, page_is_fake_head() must be ordered after page_ref_count() on CPU 2 so that it can only return true for this case, to avoid the later attempt to modify r/o struct page[]. This patch adds the missing memory barrier and makes the tests on page_is_fake_head() and page_ref_count() done in the proper order. Link: https://lkml.kernel.org/r/20250108074822.722696-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Will Deacon <will@kernel.org> Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/ Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:29 -08:00
Kairui Song	045191a877	zram: fix potential UAF of zram table If zram_meta_alloc failed early, it frees allocated zram->table without setting it NULL. Which will potentially cause zram_meta_free to access the table if user reset an failed and uninitialized device. Link: https://lkml.kernel.org/r/20250107065446.86928-1-ryncsn@gmail.com Fixes: 74363ec674cb ("zram: fix uninitialized ZRAM not releasing backing device") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:29 -08:00
Ryan Roberts	39a444282e	selftests/mm: set allocated memory to non-zero content in cow test After commit b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp"), cow test cases involving swapping out THPs via madvise(MADV_PAGEOUT) started to be skipped due to the subsequent check via pagemap determining that the memory was not actually swapped out. Logs similar to this were emitted: ... # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (16 kB) ok 2 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP (16 kB) ok 3 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (32 kB) ok 4 # SKIP MADV_PAGEOUT did not work, is swap enabled? ... The commit in question introduces the behaviour of scanning THPs and if their content is predominantly zero, it splits them and replaces the pages which are wholly zero with the zero page. These cow test cases were getting caught up in this. So let's avoid that by filling the contents of all allocated memory with a non-zero value. With this in place, the tests are passing again. Link: https://lkml.kernel.org/r/20250107142555.1870101-1-ryan.roberts@arm.com Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:29 -08:00
Ryan Roberts	7cc36ca820	mm: clear uffd-wp PTE/PMD state on mremap() When mremap()ing a memory region previously registered with userfaultfd as write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in flag clearing leads to a mismatch between the vma flags (which have uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp cleared). This mismatch causes a subsequent mprotect(PROT_WRITE) to trigger a warning in page_table_check_pte_flags() due to setting the pte to writable while uffd-wp is still set. Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any such mremap() so that the values are consistent with the existing clearing of VM_UFFD_WP. Be careful to clear the logical flag regardless of its physical form; a PTE bit, a swap PTE bit, or a PTE marker. Cover PTE, huge PMD and hugetlb paths. Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/ Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl") Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:29 -08:00
Thomas Weißschuh	6c2a1908f6	selftests/mm: virtual_address_range: avoid reading VVAR mappings The virtual_address_range selftest reads from the start of each mapping listed in /proc/self/maps. However not all mappings are valid to be arbitrarily accessed. For example the vvar data used for virtual clocks on x86 can only be accessed if 1) the kernel configuration enables virtual clocks and 2) the hypervisor provided the data for it, which can only determined by the VDSO code itself. Since commit e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping") the virtual clock data was split out into its own mapping, triggering faulting accesses by virtual_address_range. Skip the various vvar mappings in virtual_address_range to avoid errors. Link: https://lkml.kernel.org/r/20250107-virtual_address_range-tests-v1-2-3834a2fb47fe@linutronix.de Fixes: e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping") Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202412271148.2656e485-lkp@intel.com Cc: Dev Jain <dev.jain@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:28 -08:00
Thomas Weißschuh	b1b7050709	selftests/mm: virtual_address_range: fix error when CommitLimit < 1GiB If not enough physical memory is available the kernel may fail mmap(); see __vm_enough_memory() and vm_commit_limit(). In that case the logic in validate_complete_va_space() does not make sense and will even incorrectly fail. Instead skip the test if no mmap() succeeded. Link: https://lkml.kernel.org/r/20250107-virtual_address_range-tests-v1-1-3834a2fb47fe@linutronix.de Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: <stable@vger.kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:28 -08:00
Petr Pavlu	54582e074a	module: fix writing of livepatch relocations in ROX text A livepatch module can contain a special relocation section .klp.rela.<objname>.<secname> to apply its relocations at the appropriate time and to additionally access local and unexported symbols. When <objname> points to another module, such relocations are processed separately from the regular module relocation process. For instance, only when the target <objname> actually becomes loaded. With CONFIG_STRICT_MODULE_RWX, when the livepatch core decides to apply these relocations, their processing results in the following bug: [ 25.827238] BUG: unable to handle page fault for address: 00000000000012ba [ 25.827819] #PF: supervisor read access in kernel mode [ 25.828153] #PF: error_code(0x0000) - not-present page [ 25.828588] PGD 0 P4D 0 [ 25.829063] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 25.829742] CPU: 2 UID: 0 PID: 452 Comm: insmod Tainted: G O K 6.13.0-rc4-00078-g059dd502b263 #7820 [ 25.830417] Tainted: [O]=OOT_MODULE, [K]=LIVEPATCH [ 25.830768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 25.831651] RIP: 0010:memcmp+0x24/0x60 [ 25.832190] Code: [...] [ 25.833378] RSP: 0018:ffffa40b403a3ae8 EFLAGS: 00000246 [ 25.833637] RAX: 0000000000000000 RBX: ffff93bc81d8e700 RCX: ffffffffc0202000 [ 25.834072] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 00000000000012ba [ 25.834548] RBP: ffffa40b403a3b68 R08: ffffa40b403a3b30 R09: 0000004a00000002 [ 25.835088] R10: ffffffffffffd222 R11: f000000000000000 R12: 0000000000000000 [ 25.835666] R13: ffffffffc02032ba R14: ffffffffc007d1e0 R15: 0000000000000004 [ 25.836139] FS: 00007fecef8c3080(0000) GS:ffff93bc8f900000(0000) knlGS:0000000000000000 [ 25.836519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 25.836977] CR2: 00000000000012ba CR3: 0000000002f24000 CR4: 00000000000006f0 [ 25.837442] Call Trace: [ 25.838297] <TASK> [ 25.841083] __write_relocate_add.constprop.0+0xc7/0x2b0 [ 25.841701] apply_relocate_add+0x75/0xa0 [ 25.841973] klp_write_section_relocs+0x10e/0x140 [ 25.842304] klp_write_object_relocs+0x70/0xa0 [ 25.842682] klp_init_object_loaded+0x21/0xf0 [ 25.842972] klp_enable_patch+0x43d/0x900 [ 25.843572] do_one_initcall+0x4c/0x220 [ 25.844186] do_init_module+0x6a/0x260 [ 25.844423] init_module_from_file+0x9c/0xe0 [ 25.844702] idempotent_init_module+0x172/0x270 [ 25.845008] __x64_sys_finit_module+0x69/0xc0 [ 25.845253] do_syscall_64+0x9e/0x1a0 [ 25.845498] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 25.846056] RIP: 0033:0x7fecef9eb25d [ 25.846444] Code: [...] [ 25.847563] RSP: 002b:00007ffd0c5d6de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 25.848082] RAX: ffffffffffffffda RBX: 000055b03f05e470 RCX: 00007fecef9eb25d [ 25.848456] RDX: 0000000000000000 RSI: 000055b001e74e52 RDI: 0000000000000003 [ 25.848969] RBP: 00007ffd0c5d6ea0 R08: 0000000000000040 R09: 0000000000004100 [ 25.849411] R10: 00007fecefac7b20 R11: 0000000000000246 R12: 000055b001e74e52 [ 25.849905] R13: 0000000000000000 R14: 000055b03f05e440 R15: 0000000000000000 [ 25.850336] </TASK> [ 25.850553] Modules linked in: deku(OK+) uinput [ 25.851408] CR2: 00000000000012ba [ 25.852085] ---[ end trace 0000000000000000 ]--- The problem is that the .klp.rela.<objname>.<secname> relocations are processed after the module was already formed and mod->rw_copy was reset. However, the code in __write_relocate_add() calls module_writable_address() which translates the target address 'loc' still to 'loc + (mem->rw_copy - mem->base)', with mem->rw_copy now being 0. Fix the problem by returning directly 'loc' in module_writable_address() when the module is already formed. Function __write_relocate_add() knows to use text_poke() in such a case. Link: https://lkml.kernel.org/r/20250107153507.14733-1-petr.pavlu@suse.com Fixes: 0c133b1e78cd ("module: prepare to handle ROX allocations for text") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reported-by: Marek Maslanka <mmaslanka@google.com> Closes: https://lore.kernel.org/linux-modules/CAGcaFA2hdThQV6mjD_1_U+GNHThv84+MQvMWLgEuX+LVbAyDxg@mail.gmail.com/ Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:28 -08:00
Yosry Ahmed	47c2fab082	mm-zswap-properly-synchronize-freeing-resources-during-cpu-hotunplug-fix remove comment Link: https://lkml.kernel.org/r/CAJD7tkaxS1wjn+swugt8QCvQ-rVF5RZnjxwPGX17k8x9zSManA@mail.gmail.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sam Sun <samsun1006219@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:27 -08:00
Yosry Ahmed	b8668bad31	mm: zswap: properly synchronize freeing resources during CPU hotunplug In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the current CPU at the beginning of the operation is retrieved and used throughout. However, since neither preemption nor migration are disabled, it is possible that the operation continues on a different CPU. If the original CPU is hotunplugged while the acomp_ctx is still in use, we run into a UAF bug as some of the resources attached to the acomp_ctx are freed during hotunplug in zswap_cpu_comp_dead() (i.e. acomp_ctx.buffer, acomp_ctx.req, or acomp_ctx.acomp). The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") when the switch to the crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was retrieved using get_cpu_ptr() which disables preemption and makes sure the CPU cannot go away from under us. Preemption cannot be disabled with the crypto_acomp API as a sleepable context is needed. Use the acomp_ctx.mutex to synchronize CPU hotplug callbacks allocating and freeing resources with compression/decompression paths. Make sure that acomp_ctx.req is NULL when the resources are freed. In the compression/decompression paths, check if acomp_ctx.req is NULL after acquiring the mutex (meaning the CPU was offlined) and retry on the new CPU. The initialization of acomp_ctx.mutex is moved from the CPU hotplug callback to the pool initialization where it belongs (where the mutex is allocated). In addition to adding clarity, this makes sure that CPU hotplug cannot reinitialize a mutex that is already locked by compression/decompression. Previously a fix was attempted by holding cpus_read_lock() [1]. This would have caused a potential deadlock as it is possible for code already holding the lock to fall into reclaim and enter zswap (causing a deadlock). A fix was also attempted using SRCU for synchronization, but Johannes pointed out that synchronize_srcu() cannot be used in CPU hotplug notifiers [2]. Alternative fixes that were considered/attempted and could have worked: - Refcounting the per-CPU acomp_ctx. This involves complexity in handling the race between the refcount dropping to zero in zswap_[de]compress() and the refcount being re-initialized when the CPU is onlined. - Disabling migration before getting the per-CPU acomp_ctx [3], but that's discouraged and is a much bigger hammer than needed, and could result in subtle performance issues. [1]https://lkml.kernel.org/20241219212437.2714151-1-yosryahmed@google.com/ [2]https://lkml.kernel.org/20250107074724.1756696-2-yosryahmed@google.com/ [3]https://lkml.kernel.org/20250107222236.2715883-2-yosryahmed@google.com/ Link: https://lkml.kernel.org/r/20250108222441.3622031-1-yosryahmed@google.com Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ Reported-by: Sam Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:27 -08:00
Yosry Ahmed	36357fd23a	Revert "mm: zswap: fix race between [de]compression and CPU hotunplug" This reverts commit eaebeb93922ca6ab0dd92027b73d0112701706ef. Commit eaebeb93922c ("mm: zswap: fix race between [de]compression and CPU hotunplug") used the CPU hotplug lock in zswap compress/decompress operations to protect against a race with CPU hotunplug making some per-CPU resources go away. However, zswap compress/decompress can be reached through reclaim while the lock is held, resulting in a potential deadlock as reported by syzbot: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Not tainted ------------------------------------------------------ kswapd0/89 is trying to acquire lock: ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: acomp_ctx_get_cpu mm/zswap.c:886 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_compress mm/zswap.c:908 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store_page mm/zswap.c:1439 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 but task is already holding lock: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 __fs_reclaim_acquire mm/page_alloc.c:3853 [inline] fs_reclaim_acquire+0x88/0x130 mm/page_alloc.c:3867 might_alloc include/linux/sched/mm.h:318 [inline] slab_pre_alloc_hook mm/slub.c:4070 [inline] slab_alloc_node mm/slub.c:4148 [inline] __kmalloc_cache_node_noprof+0x40/0x3a0 mm/slub.c:4337 kmalloc_node_noprof include/linux/slab.h:924 [inline] alloc_worker kernel/workqueue.c:2638 [inline] create_worker+0x11b/0x720 kernel/workqueue.c:2781 workqueue_prepare_cpu+0xe3/0x170 kernel/workqueue.c:6628 cpuhp_invoke_callback+0x48d/0x830 kernel/cpu.c:194 __cpuhp_invoke_callback_range kernel/cpu.c:965 [inline] cpuhp_invoke_callback_range kernel/cpu.c:989 [inline] cpuhp_up_callbacks kernel/cpu.c:1020 [inline] _cpu_up+0x2b3/0x580 kernel/cpu.c:1690 cpu_up+0x184/0x230 kernel/cpu.c:1722 cpuhp_bringup_mask+0xdf/0x260 kernel/cpu.c:1788 cpuhp_bringup_cpus_parallel+0xf9/0x160 kernel/cpu.c:1878 bringup_nonboot_cpus+0x2b/0x50 kernel/cpu.c:1892 smp_init+0x34/0x150 kernel/smp.c:1009 kernel_init_freeable+0x417/0x5d0 init/main.c:1569 kernel_init+0x1d/0x2b0 init/main.c:1466 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #0 (cpu_hotplug_lock){++++}-{0:0}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(cpu_hotplug_lock); lock(fs_reclaim); rlock(cpu_hotplug_lock); * DEADLOCK * 1 lock held by kswapd0/89: #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 stack backtrace: CPU: 0 UID: 0 PID: 89 Comm: kswapd0 Not tainted 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Revert the change. A different fix for the race with CPU hotunplug will follow. Link: https://lkml.kernel.org/r/20250107222236.2715883-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sam Sun <samsun1006219@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:27 -08:00
Muchun Song	56b2294d7d	hugetlb: fix NULL pointer dereference in trace_hugetlbfs_alloc_inode hugetlb_file_setup() will pass a NULL @dir to hugetlbfs_get_inode(), so we will access a NULL pointer for @dir. Fix it and set __entry->dr to 0 if @dir is NULL. Because ->i_ino cannot be 0 (see get_next_ino()), there is no confusing if user sees a 0 inode number. Link: https://lkml.kernel.org/r/20250106033118.4640-1-songmuchun@bytedance.com Fixes: 318580ad7f28 ("hugetlbfs: support tracepoint") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reported-by: Cheung Wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/02858D60-43C1-4863-A84F-3C76A8AF1F15@linux.dev/T/# Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Cc: cheung wall <zzqq0103.hey@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:27 -08:00
Stefan Roesch	5c475df36d	mm-fix-div-by-zero-in-bdi_ratio_from_pages-v3 During testing it has been detected, that it is possible to get div by zero error in bdi_set_min_bytes. The error is caused by the function bdi_ratio_from_pages(). bdi_ratio_from_pages() calls global_dirty_limits. If the dirty threshold is 0, the div by zero is raised. This can happen if the root user is setting: echo 0 > /proc/sys/vm/dirty_ratio The following is a test case: echo 0 > /proc/sys/vm/dirty_ratio cd /sys/class/bdi/<device> echo 1 > strict_limit echo 8192 > min_bytes ==> error is raised. The problem is addressed by returning -EINVAL if dirty_ratio or dirty_bytes is set to 0. Link: https://lkml.kernel.org/r/20250109063411.6591-1-shr@devkernel.io Reported-by: cheung wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/87pll35yd0.fsf@devkernel.io/T/#t Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:27 -08:00
Stefan Roesch	4a8c239933	mm-fix-div-by-zero-in-bdi_ratio_from_pages-v2 check for -EINVAL in bdi_set_min_bytes() and bdi_set_max_bytes() Link: https://lkml.kernel.org/r/20250108014723.166637-1-shr@devkernel.io Reported-by: cheung wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/87pll35yd0.fsf@devkernel.io/T/#t Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:26 -08:00
Stefan Roesch	738fa81619	mm: fix div by zero in bdi_ratio_from_pages During testing it was detected that it is possible to get div by zero error in bdi_set_min_bytes. The error is caused by the function bdi_ratio_from_pages(). bdi_ratio_from_pages() calls global_dirty_limits. If the dirty threshold is 0, the div by zero is raised. This can happen if the root user is setting: echo 0 > /proc/sys/vm/dirty_ratio The following is a test case: echo 0 > /proc/sys/vm/dirty_ratio cd /sys/class/bdi/<device> echo 1 > strict_limit echo 8192 > min_bytes ==> error is raised. The problem is addressed by returning -EINVAL if dirty_ratio or dirty_bytes is set to 0. Link: https://lkml.kernel.org/r/20250104012037.159386-1-shr@devkernel.io Reported-by: cheung wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/87pll35yd0.fsf@devkernel.io/T/#t Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Qiang Zhang <zzqq0103.hey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:26 -08:00
Juergen Gross	dcb1623e2a	x86/execmem: fix ROX cache usage in Xen PV guests The recently introduced ROX cache for modules is assuming large page support in 64-bit mode without testing the related feature bit. This results in breakage when running as a Xen PV guest, as in this mode large pages are not supported. Fix that by testing the X86_FEATURE_PSE capability when deciding whether to enable the ROX cache. Link: https://lkml.kernel.org/r/20250103065631.26459-1-jgross@suse.com Fixes: 2e45474ab14f ("execmem: add support for cache of large ROX pages") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:26 -08:00
Marco Nelissen	8c97f863da	filemap: avoid truncating 64-bit offset to 32 bits On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a 64-bit value to 32 bits, leading to a possible infinite loop when writing to an xfs filesystem. Link: https://lkml.kernel.org/r/20250102190540.1356838-1-marco.nelissen@gmail.com Fixes: 54fa39ac2e00 ("iomap: use mapping_seek_hole_data") Signed-off-by: Marco Nelissen <marco.nelissen@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-11 23:17:26 -08:00

1 2 3 4 5 ...

1325211 Commits