linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2024-12-28 00:32:00 +00:00

Author	SHA1	Message	Date
Stephen Rothwell	24896579b9	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git	2024-12-20 15:11:58 +11:00
Stephen Rothwell	7b5a9f355d	Merge branch 'slab/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git	2024-12-20 15:11:19 +11:00
Stephen Rothwell	ea2356f802	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git	2024-12-20 13:32:50 +11:00
Stephen Rothwell	07ccd1271c	Merge branch 'fs-next' of linux-next	2024-12-20 10:45:33 +11:00
Stephen Rothwell	cd07c43f9b	Merge branch 'vfs.all' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git	2024-12-20 09:19:26 +11:00
Stephen Rothwell	8dad5129f0	Merge branch 'for_next' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git	2024-12-20 09:19:18 +11:00
Ingo Molnar	08eccca432	Merge branch into tip/master: 'perf/core' # New commits in perf/core: `02c56362a7` ("uprobes: Guard against kmemdup() failing in dup_return_instance()") `d29e744c71` ("perf/x86: Relax privilege filter restriction on AMD IBS") `6057b90ecc` ("perf/core: Export perf_exclude_event()") `8622e45b5d` ("uprobes: Reuse return_instances between multiple uretprobes within task") `0cf981de76` ("uprobes: Ensure return_instance is detached from the list before freeing") `636666a1c7` ("uprobes: Decouple return_instance list traversal and freeing") `2ff913ab3f` ("uprobes: Simplify session consumer tracking") `e0925f2dc4` ("uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution") `83e3dc9a5d` ("uprobes: simplify find_active_uprobe_rcu() VMA checks") `03a001b156` ("mm: introduce mmap_lock_speculate_{try_begin\|retry}") `eb449bd969` ("mm: convert mm_lock_seq to a proper seqcount") `7528585290` ("mm/gup: Use raw_seqcount_try_begin()") `96450ead16` ("seqlock: add raw_seqcount_try_begin") `b4943b8bfc` ("perf/x86/rapl: Add core energy counter support for AMD CPUs") `54d2759778` ("perf/x86/rapl: Move the cntr_mask to rapl_pmus struct") `bdc57ec705` ("perf/x86/rapl: Remove the global variable rapl_msrs") `abf03d9bd2` ("perf/x86/rapl: Modify the generic variable names to _pkg") `eeca4c6b25` ("perf/x86/rapl: Add arguments to the init and cleanup functions") `cd29d83a6d` ("perf/x86/rapl: Make rapl_model struct global") `8bf1c86e5a` ("perf/x86/rapl: Rename rapl_pmu variables") `1d5e2f637a` ("perf/x86/rapl: Remove the cpu_to_rapl_pmu() function") `e4b4443477` ("x86/topology: Introduce topology_logical_core_id()") `2f2db34707` ("perf/x86/rapl: Remove the unused get_rapl_pmu_cpumask() function") `ae55e308bd` ("perf/x86/intel/ds: Simplify the PEBS records processing for adaptive PEBS") `3c00ed344c` ("perf/x86/intel/ds: Factor out functions for PEBS records processing") `7087bfb0ad` ("perf/x86/intel/ds: Clarify adaptive PEBS processing") `faac6f105e` ("perf/core: Check sample_type in perf_sample_save_brstack") `f226805bc5` ("perf/core: Check sample_type in perf_sample_save_callchain") `b9c44b9147` ("perf/core: Save raw sample data conditionally based on sample type") Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-12-19 20:24:25 +01:00
Andrew Morton	45f41efd96	foo	2024-12-18 19:51:48 -08:00
Easwar Hariharan	3b44460d5e	mm: kmemleak: convert timeouts to secs_to_jiffies() Commit `b35108a51c` ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @@ constant C; @@ - msecs_to_jiffies(C * 1000) + secs_to_jiffies(C) @@ constant C; @@ - msecs_to_jiffies(C * MSEC_PER_SEC) + secs_to_jiffies(C) Link: https://lkml.kernel.org/r/20241210-converge-secs-to-jiffies-v3-6-ddfefd7e9f2a@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Daniel Mack <daniel@zonque.org> Cc: David Airlie <airlied@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dick Kennedy <dick.kennedy@broadcom.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jack Wang <jinpu.wang@cloud.ionos.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: James Smart <james.smart@broadcom.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Jeff Johnson <jjohnson@kernel.org> Cc: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jeroen de Borst <jeroendb@google.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kalle Valo <kvalo@kernel.org> Cc: Louis Peens <louis.peens@corigine.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Ofir Bitton <obitton@habana.ai> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Praveen Kaligineedi <pkaligineedi@google.com> Cc: Ray Jui <rjui@broadcom.com> Cc: Robert Jarzmik <robert.jarzmik@free.fr> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Scott Branden <sbranden@broadcom.com> Cc: Shailend Chand <shailend@google.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Simon Horman <horms@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:35 -08:00
Andrew Morton	5555a83c82	replace-free-hugepage-folios-after-migration-fix fix comments, 80-column tweak Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: yangge <yangge1116@126.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:11 -08:00
yangge	7e5670da45	mm: replace free hugepage folios after migration My machine has 4 NUMA nodes, each equipped with 32GB of memory. I have configured each NUMA node with 16GB of CMA and 16GB of in-use hugetlb pages. The allocation of contiguous memory via cma_alloc() can fail probabilistically. cma_alloc() may fail if it sees an in-use hugetlb page within the allocation range, even if that page has already been migrated. When in-use hugetlb pages are migrated, they may simply be released back into the free hugepage pool instead of being returned to the buddy system. This can cause test_pages_isolated() check to fail, ultimately leading to the failure of cma_alloc(): cma_alloc() __alloc_contig_migrate_range() // migrate in-use hugepage test_pages_isolated() __test_page_isolated_in_pageblock() PageBuddy(page) // check if the page is in buddy To address this issue, we add a function named replace_free_hugepage_folios(). This function will replace the hugepage in the free hugepage pool with a new one and release the old one to the buddy system. After the migration of in-use hugetlb pages is completed, we will invoke replace_free_hugepage_folios() to ensure that these hugepages are properly released to the buddy system. Following this step, when test_pages_isolated() is executed for inspection, it will successfully pass. Link: https://lkml.kernel.org/r/1734503588-16254-1-git-send-email-yangge1116@126.com Signed-off-by: yangge <yangge1116@126.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:11 -08:00
Kairui Song	04940358e8	mm/swap_cgroup: decouple swap cgroup recording and clearing The current implementation of swap cgroup tracking is a bit complex and fragile: On charging path, swap_cgroup_record always records an actual memcg id, and it depends on the caller to make sure all entries passed in must belong to one single folio. As folios are always charged or uncharged as a whole, and always charged and uncharged in order, swap_cgroup doesn't need an extra lock. On uncharging path, swap_cgroup_record always sets the record to zero. These entries won't be charged again until uncharging is done. So there is no extra lock needed either. Worth noting that swap cgroup clearing may happen without folio involved, eg. exiting processes will zap its page table without swapin. The xchg/cmpxchg provides atomic operations and barriers to ensure no tearing or synchronization issue of these swap cgroup records. It works but quite error-prone. Things can be much clear and robust by decoupling recording and clearing into two helpers. Recording takes the actual folio being charged as argument, and clearing always set the record to zero, and refine the debug sanity checks to better reflect their usage Benchmark even showed a very slight improvement as it saved some extra arguments and lookups: make -j96 with defconfig on tmpfs in 1.5G memory cgroup using 4k folios: Before: sys 9617.23 (stdev 37.764062) After : sys 9541.54 (stdev 42.973976) make -j96 with defconfig on tmpfs in 2G memory cgroup using 64k folios: Before: sys 7358.98 (stdev 54.927593) After : sys 7337.82 (stdev 39.398956) Link: https://lkml.kernel.org/r/20241218114633.85196-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:09 -08:00
Kairui Song	4a5fca208e	mm/swap_cgroup: remove global swap cgroup lock commit `e9e58a4ec3` ("memcg: avoid use cmpxchg in swap cgroup maintainance") replaced the cmpxchg/xchg with a global irq spinlock because some archs doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well. And as commented in swap_cgroup.c, this lock is not needed for map synchronization. Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement it to get rid of this lock. Introduced two helpers for doing so and they can be easily dropped if a generic 2 byte xchg is support. Testing using 64G brd and build with build kernel with make -j96 in 1.5G memory cgroup using 4k folios showed below improvement (6 test run): Before this series: Sys time: 10782.29 (stdev 42.353886) Real time: 171.49 (stdev 0.595541) After this commit: Sys time: 9617.23 (stdev 37.764062), -10.81% Real time: 159.65 (stdev 0.587388), -6.90% With 64k folios and 2G memcg: Before this series: Sys time: 8176.94 (stdev 26.414712) Real time: 141.98 (stdev 0.797382) After this commit: Sys time: 7358.98 (stdev 54.927593), -10.00% Real time: 134.07 (stdev 0.757463), -5.57% Sequential swapout of 8G 64k zero folios with madvise (24 test run): Before this series: 5461409.12 us (stdev 183957.827084) After this commit: 5420447.26 us (stdev 196419.240317) Sequential swapin of 8G 4k zero folios (24 test run): Before this series: 19736958.916667 us (stdev 189027.246676) After this commit: 19662182.629630 us (stdev 172717.640614) Performance is better or at least not worse for all tests above. Link: https://lkml.kernel.org/r/20241218114633.85196-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:08 -08:00
Kairui Song	38fdedd883	mm/swap_cgroup: remove swap_cgroup_cmpxchg This function is never used after commit `6b611388b6` ("memcg-v1: remove charge move code"). Link: https://lkml.kernel.org/r/20241218114633.85196-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:08 -08:00
Kairui Song	20214ad102	mm, memcontrol: avoid duplicated memcg enable check Patch series "mm/swap_cgroup: remove global swap cgroup lock", v3. This series removes the global swap cgroup lock. The critical section of this lock is very short but it's still a bottle neck for mass parallel swap workloads. Up to 10% performance gain for tmpfs build kernel test on a 48c96t system under memory pressure, and no regression for other cases: This patch (of 3): mem_cgroup_uncharge_swap() includes a mem_cgroup_disabled() check, so the caller doesn't need to check that. Link: https://lkml.kernel.org/r/20241218114633.85196-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20241218114633.85196-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:08 -08:00
Thomas Weißschuh	9eb6b0801a	mm/page_idle: constify 'struct bin_attribute' The sysfs core now allows instances of 'struct bin_attribute' to be moved into read-only memory. Make use of that to protect them against accidental or malicious modifications. Link: https://lkml.kernel.org/r/20241216-sysfs-const-bin_attr-page_idle-v1-1-cc01ecc55196@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:07 -08:00
Andrew Morton	23b8261fa8	mm/huge_memory.c: rename shadowed local split_huge_pages_write() has a lccal `buf' which shadows incoming arg `buf'. Reviewer confusion resulted. Rename the inner local to `tok_buf'. Cc: Leo Stone <leocstone@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:07 -08:00
Joanne Koong	2cdf65eaaa	mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings For migrations called in MIGRATE_SYNC mode, skip migrating the folio if it is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its mapping. If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping, the writeback may take an indeterminate amount of time to complete, and waits may get stuck. Link: https://lkml.kernel.org/r/20241122232359.429647-5-joannelkoong@gmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Bernd Schubert <bernd.schubert@fastmail.fm> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:06 -08:00
Joanne Koong	26942ed372	mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Currently in shrink_folio_list(), reclaim for folios under writeback falls into 3 different cases: 1) Reclaim is encountering an excessive number of folios under writeback and this folio has both the writeback and reclaim flags set 2) Dirty throttling is enabled (this happens if reclaim through cgroup is not enabled, if reclaim through cgroupv2 memcg is enabled, or if reclaim is on the root cgroup), or if the folio is not marked for immediate reclaim, or if the caller does not have __GFP_FS (or __GFP_IO if it's going to swap) set 3) Legacy cgroupv1 encounters a folio that already has the reclaim flag set and the caller did not have __GFP_FS (or __GFP_IO if swap) set In cases 1) and 2), we activate the folio and skip reclaiming it while in case 3), we wait for writeback to finish on the folio and then try to reclaim the folio again. In case 3, we wait on writeback because cgroupv1 does not have dirty folio throttling, as such this is a mitigation against the case where there are too many folios in writeback with nothing else to reclaim. For filesystems where writeback may take an indeterminate amount of time to write to disk, this has the possibility of stalling reclaim. In this commit, if legacy memcg encounters a folio with the reclaim flag set (eg case 3) and the folio belongs to a mapping that has the AS_WRITEBACK_INDETERMINATE flag set, the folio will be activated and skip reclaim (eg default to behavior in case 2) instead. Link: https://lkml.kernel.org/r/20241122232359.429647-3-joannelkoong@gmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Bernd Schubert <bernd.schubert@fastmail.fm> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:06 -08:00
Christoph Hellwig	969a9f8a78	mm: unexport apply_to_existing_page_range apply_to_existing_page_range() is only used by non-modular code. Link: https://lkml.kernel.org/r/20241212073423.1439954-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:05 -08:00
Andrew Morton	a6d825ca13	mm-fix-outdated-incorrect-code-comments-for-handle_mm_fault-fix s/mmap_Lock/mmap_lock/, per Liam Cc: Jinliang Zheng <alexjlzheng@gmail.com> Cc: Jinliang Zheng <alexjlzheng@tencent.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:05 -08:00
Jinliang Zheng	e450da5cb2	mm: fix outdated incorrect code comments for handle_mm_fault() Link: https://lkml.kernel.org/r/20241213031820.778342-1-alexjlzheng@tencent.com Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:05 -08:00
Guo Weikang	d60b6c803c	mm/early_ioremap: add null pointer checks to prevent NULL-pointer dereference The early_ioremap interface can fail and return NULL in certain cases. To prevent NULL-pointer dereference crashes, fixed issues in the acpi_extlog and copy_early_mem interfaces, improving robustness when handling early memory. Link: https://lkml.kernel.org/r/20241212101004.1544070-1-guoweikang.kernel@gmail.com Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Julian Stecklina <julian.stecklina@cyberus-technology.de> Cc: Kevin Loughlin <kevinloughlin@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:03 -08:00
Lorenzo Stoakes	2590d8e8f8	mm: add comments to do_mmap(), mmap_region() and vm_mmap() It isn't always entirely clear to users the difference between do_mmap(), mmap_region() and vm_mmap(), so add comments to clarify what's going on in each. This is compounded by the fact that we actually allow callers external to mm to invoke both do_mmap() and mmap_region() (!), the latter of which is really strictly speaking an internal memory mapping implementation detail. Link: https://lkml.kernel.org/r/20241212113152.28849-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:03 -08:00
Lorenzo Stoakes	92743a959d	mm: assert mmap write lock held on do_mmap(), mmap_region() Both of these functions can be invoked outside of mm, so it is probably a good idea to assert that the required lock is held. Will only have an impact if CONFIG_DEBUG_VM is set, otherwise this amounts to no change at all. Link: https://lkml.kernel.org/r/20241212114841.55185-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:02 -08:00
Joshua Hahn	9ebaae011c	memcg/hugetlb: remove memcg hugetlb try-commit-cancel protocol This patch fully removes the mem_cgroup_{try, commit, cancel}_charge functions, as well as their hugetlb variants. Link: https://lkml.kernel.org/r/20241211203951.764733-4-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:02 -08:00
Joshua Hahn	abfe537a39	memcg/hugetlb: introduce mem_cgroup_charge_hugetlb This patch introduces mem_cgroup_charge_hugetlb which combines the logic of mem_cgroup_hugetlb_try_charge / mem_cgroup_hugetlb_commit_charge and removes the need for mem_cgroup_hugetlb_cancel_charge. It also reduces the footprint of memcg in hugetlb code and consolidates all memcg related error paths into one. Link: https://lkml.kernel.org/r/20241211203951.764733-3-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:02 -08:00
Joshua Hahn	fbd3462b74	memcg/hugetlb: introduce memcg_accounts_hugetlb Patch series "memcg/hugetlb: Rework memcg hugetlb charging", v3. This series cleans up memcg's hugetlb charging logic by deprecating the current memcg hugetlb try-charge + {commit, cancel} logic present in alloc_hugetlb_folio. A single function mem_cgroup_charge_hugetlb takes its place instead. This makes the code more maintainable by simplifying the error path and reduces memcg's footprint in hugetlb logic. This patch introduces a few changes in the hugetlb folio allocation error path: (a) Instead of having multiple return points, we consolidate them to two: one for reaching the memcg limit or running out of memory (-ENOMEM) and one for hugetlb allocation fails / limit being reached (-ENOSPC). (b) Previously, the memcg limit was checked before the folio is acquired, meaning the hugeTLB folio isn't acquired if the limit is reached. This patch performs the charging after the folio is reached, meaning if memcg's limit is reached, the acquired folio is freed right away. This patch builds on two earlier patch series: [2] which adds memcg hugeTLB counters, and [3] which deprecates charge moving and removes the last references to mem_cgroup_cancel_charge. The request for this cleanup can be found in [2]. [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/ [2] https://lore.kernel.org/all/20241101204402.1885383-1-joshua.hahnjy@gmail.com/ [3] https://lore.kernel.org/linux-mm/20241025012304.2473312-1-shakeel.butt@linux.dev/ This patch (of 3): This patch isolates the check for whether memcg accounts hugetlb. This condition can only be true if the memcg mount option memory_hugetlb_accounting is on, which includes hugetlb usage in memory.current. Link: https://lkml.kernel.org/r/20241211203951.764733-1-joshua.hahnjy@gmail.com Link: https://lkml.kernel.org/r/20241211203951.764733-2-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:01 -08:00
Hyeonggon Yoo	b126895586	mm/migrate: remove slab checks in isolate_movable_page() Commit `8b8817630a` ("mm/migrate: make isolate_movable_page() skip slab pages") introduced slab checks to prevent mis-identification of slab pages as movable kernel pages. However, after Matthew's frozen folio series, these slab checks became unnecessary as the migration logic fails to increase the reference count for frozen slab folios. Remove these redundant slab checks and associated memory barriers. Link: https://lkml.kernel.org/r/20241210124807.8584-1-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:01 -08:00
Yu Zhao	88b2a8d4ad	mm/mglru: rework workingset protection With the aging feedback no longer considering the distribution of folios in each generation, rework workingset protection to better distribute folios across MAX_NR_GENS. This is achieved by reusing PG_workingset and PG_referenced/LRU_REFS_FLAGS in a slightly different way. For folios accessed multiple times through file descriptors, make lru_gen_inc_refs() set additional bits of LRU_REFS_WIDTH in folio->flags after PG_referenced, then PG_workingset after LRU_REFS_WIDTH. After all its bits are set, i.e., LRU_REFS_FLAGS\|BIT(PG_workingset), a folio is lazily promoted into the second oldest generation in the eviction path. And when folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that lru_gen_inc_refs() can start over. For this case, LRU_REFS_MASK is only valid when PG_referenced is set. For folios accessed multiple times through page tables, folio_update_gen() from a page table walk or lru_gen_set_refs() from a rmap walk sets PG_referenced after the accessed bit is cleared for the first time. Thereafter, those two paths set PG_workingset and promote folios to the youngest generation. Like folio_inc_gen(), when folio_update_gen() does that, it also clears PG_referenced. For this case, LRU_REFS_MASK is not used. For both of the cases, after PG_workingset is set on a folio, it remains until this folio is either reclaimed, or "deactivated" by lru_gen_clear_refs(). It can be set again if lru_gen_test_recent() returns true upon a refault. When adding folios to the LRU lists, lru_gen_distance() distributes them as follows: +---------------------------------+---------------------------------+ \| Accessed thru page tables \| Accessed thru file descriptors \| +---------------------------------+---------------------------------+ \| PG_active (set while isolated) \| \| +----------------+----------------+----------------+----------------+ \| PG_workingset \| PG_referenced \| PG_workingset \| LRU_REFS_FLAGS \| +---------------------------------+---------------------------------+ \|<--------- MIN_NR_GENS --------->\| \| \|<-------------------------- MAX_NR_GENS -------------------------->\| After this patch, some typical client and server workloads showed improvements under heavy memory pressure. For example, Python TPC-C, which was used to benchmark a different approach [1] to better detect refault distances, showed a significant decrease in total refaults: Before After Change Time (seconds) 10801 10801 0% Executed (transactions) 41472 43663 +5% workingset_nodes 109070 120244 +10% workingset_refault_anon 5019627 7281831 +45% workingset_refault_file 1294678786 554855564 -57% workingset_refault_total 1299698413 562137395 -57% [1] https://lore.kernel.org/20230920190244.16839-1-ryncsn@gmail.com/ Link: https://lkml.kernel.org/r/20241207221522.2250311-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Kairui Song <kasong@tencent.com> Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/ Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: Bharata B Rao <bharata@amd.com> Cc: David Stevens <stevensd@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:59 -08:00
Yu Zhao	1bef50327e	mm/mglru: rework refault detection With anon and file min_seq being able to move independently, rework workingset protection as well so that the comparison of refaults between anon and file is always on an equal footing. Specifically, make lru_gen_test_recent() return true for refaults happening within the distance of MAX_NR_GENS. For example, if min_seq of a type is max_seq-MIN_NR_GENS, refaults from min_seq-1, i.e., max_seq-MIN_NR_GENS-1, are also considered recent, since the distance max_seq-(max_seq-MIN_NR_GENS-1), i.e., MIN_NR_GENS+1 is less than MAX_NR_GENS. As an intermediate step to the final optimization, this change by itself should not have userspace-visiable effects beyond performance. Link: https://lkml.kernel.org/r/20241207221522.2250311-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Kairui Song <kasong@tencent.com> Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/ Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: Bharata B Rao <bharata@amd.com> Cc: David Stevens <stevensd@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:59 -08:00
Yu Zhao	51793e247b	mm/mglru: rework type selection With anon and file min_seq being able to move independently, rework type selection so that it is based on the total refaults from all tiers of each type. Also allow a type to be selected until that type reaches MIN_NR_GENS, and therefore abs_diff(min_seq[0],min_seq[1]) now can be 2 (MAX_NR_GENS-MIN_NR_GENS) instead of 1. Since some tiers of a selected type can have higher refaults than the first tier of the other type, use a less larger gain factor 2:3 instead of 1:2, in order for those tiers in the selected type to be better protected. As an intermediate step to the final optimization, this change by itself should not have userspace-visiable effects beyond performance. Link: https://lkml.kernel.org/r/20241207221522.2250311-5-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: David Stevens <stevensd@chromium.org> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:59 -08:00
Yu Zhao	258403f4bd	mm/mglru: rework aging feedback The aging feedback is based on both the number of generations and the distribution of folios in each generation. The number of generations is currently the distance between max_seq and anon min_seq. This is because anon min_seq is not allowed to move past file min_seq. The rationale for that is that file is always evictable whereas anon is not. However, for use cases where anon is a lot cheaper than file: 1. Anon in the second oldest generation can be a better choice than file in the oldest generation. 2. A large amount of file in the oldest generation can skew the distribution, making should_run_aging() return false negative. Allow anon and file min_seq to move independently, and use solely the number of generations as the feedback for aging. Specifically, when both anon and file are evictable, anon min_seq can now be greater than file min_seq, and therefore the number of generations becomes the distance between max_seq and min(min_seq[0],min_seq[1]). And should_run_aging() returns true if and only if the number of generations is less than MAX_NR_GENS. As the first step to the final optimization, this change by itself should not have userspace-visiable effects beyond performance. The next twos patch will take advantage of this change; the last patch in this series will better distribute folios across MAX_NR_GENS. Link: https://lkml.kernel.org/r/20241207221522.2250311-4-yuzhao@google.com Reported-by: David Stevens <stevensd@chromium.org> Signed-off-by: Yu Zhao <yuzhao@google.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:58 -08:00
Yu Zhao	e0003912bc	mm/mglru: optimize deactivation Do not shuffle a folio in the deactivation paths if it is already in the oldest generation. This reduces the LRU lock contention. Before this patch, the contention is reproducible by FIO, e.g., fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \ -rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync \ -bs=4k -numjobs=400 -runtime=25000 --time_based \ -group_reporting -name=mglru 98.96%--_raw_spin_lock_irqsave folio_lruvec_lock_irqsave \| --98.78%--folio_batch_move_lru \| --98.63%--deactivate_file_folio mapping_try_invalidate invalidate_mapping_pages invalidate_bdev blkdev_common_ioctl blkdev_ioctl After this patch, deactivate_file_folio() bails out early without taking the LRU lock. A side effect is that a folio can be left at the head of the oldest generation, rather than the tail. If reclaim happens at the same time, it cannot reclaim this folio immediately. Since there is no known correlation between truncation and reclaim, this side effect is considered insignificant. Link: https://lkml.kernel.org/r/20241207221522.2250311-3-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Bharata B Rao <bharata@amd.com> Closes: https://lore.kernel.org/CAOUHufawNerxqLm7L9Yywp3HJFiYVrYO26ePUb1jH-qxNGWzyA@mail.gmail.com/ Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: David Stevens <stevensd@chromium.org> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:58 -08:00
Yu Zhao	fe09d21235	mm/mglru: clean up workingset Patch series "mm/mglru: performance optimizations", v3. This series improves performance for some previously reported test cases. Most of the code changes gathered here has been floating on the mailing list [1][2]. They are now properly organized and have gone through various benchmarks on client and server devices, including Android, FIO, memcached, multiple VMs and MongoDB. In addition to the warning [3] fixed in v2, this version fixes another warning [4] reported by syzbot. [1] https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/ [2] https://lore.kernel.org/CAOUHufawNerxqLm7L9Yywp3HJFiYVrYO26ePUb1jH-qxNGWzyA@mail.gmail.com/ [3] https://lore.kernel.org/67294349.050a0220.701a.0010.GAE@google.com/ [4] https://lore.kernel.org/67549eca.050a0220.2477f.001b.GAE@google.com/ This patch (of 6): Move VM_BUG_ON_FOLIO() to cover both the default and MGLRU paths. Also use a pair of rcu_read_lock() and rcu_read_unlock() within each path, to improve readability. This change should not have any side effects. Link: https://lkml.kernel.org/r/20241207221522.2250311-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20241207221522.2250311-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: Bharata B Rao <bharata@amd.com> Cc: David Stevens <stevensd@chromium.org> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:58 -08:00
David Hildenbrand	ba9d57e691	mm/memory_hotplug: don't use __GFP_HARDWALL when migrating pages via memory offlining We'll migrate pages allocated by other context; respecting the cpuset of the memory offlining context when allocating a migration target does not make sense. Drop the __GFP_HARDWALL by using GFP_KERNEL. Note that in an ideal world, migration code could figure out the cpuset of the original context and take that into consideration. Link: https://lkml.kernel.org/r/20241205090508.2095225-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:54 -08:00
David Hildenbrand	d235a15489	mm/page_alloc: don't use __GFP_HARDWALL when migrating pages via alloc_contig() Patch series "mm: don't use __GFP_HARDWALL when migrating remote pages". __GFP_HARDWALL means that we will be respecting the cpuset of the caller when allocating a page. However, when we are migrating remote allocations (pages allocated from other context), the cpuset of the current context is irrelevant. For memory offlining + alloc_contig_(), this is rather obvious. There might be other such page migration users, let's start with the obvious ones. This patch (of 2): We'll migrate pages allocated by other contexts; respecting the cpuset of the alloc_contig*() caller when allocating a migration target does not make sense. Drop the __GFP_HARDWALL. Note that in an ideal world, migration code could figure out the cpuset of the original context and take that into consideration. Link: https://lkml.kernel.org/r/20241205090508.2095225-1-david@redhat.com Link: https://lkml.kernel.org/r/20241205090508.2095225-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:53 -08:00
Jeff Xu	6ca1836d63	mseal: remove can_do_mseal() No code logic change. can_do_mseal() is called exclusively by mseal.c, and mseal.c is compiled only when CONFIG_64BIT flag is set in makefile. Therefore, it is unnecessary to have 32 bit stub function in the header file, remove this function and merge the logic into do_mseal(). Link: https://lkml.kernel.org/r/20241206013934.2782793-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20241206194839.3030596-2-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:52 -08:00
Guillaume Morin	cdd26c3a34	mm/hugetlb: support FOLL_FORCE\|FOLL_WRITE Eric reported that PTRACE_POKETEXT fails when applications use hugetlb for mapping text using huge pages. Before commit `1d8d14641f` ("mm/hugetlb: support write-faults in shared mappings"), PTRACE_POKETEXT worked by accident, but it was buggy and silently ended up mapping pages writable into the page tables even though VM_WRITE was not set. In general, FOLL_FORCE\|FOLL_WRITE does currently not work with hugetlb. Let's implement FOLL_FORCE\|FOLL_WRITE properly for hugetlb, such that what used to work in the past by accident now properly works, allowing applications using hugetlb for text etc. to get properly debugged. This change might also be required to implement uprobes support for hugetlb [1]. [1] https://lore.kernel.org/lkml/ZiK50qob9yl5e0Xz@bender.morinfr.org/ Link: https://lkml.kernel.org/r/Z1NshNfWuzUCPebA@bender.morinfr.org Signed-off-by: Guillaume Morin <guillaume@morinfr.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Hagberg <ehagberg@janestreet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:52 -08:00
Lorenzo Stoakes	567308fe23	mm: perform all memfd seal checks in a single place We no longer actually need to perform these checks in the f_op->mmap() hook any longer. We already moved the operation which clears VM_MAYWRITE on a read-only mapping of a write-sealed memfd in order to work around the restrictions imposed by commit `5de195060b` ("mm: resolve faulty mmap_region() error path behaviour"). There is no reason for us not to simply go ahead and additionally check to see if any pre-existing seals are in place here rather than defer this to the f_op->mmap() hook. By doing this we remove more logic from shmem_mmap() which doesn't belong there, as well as doing the same for hugetlbfs_file_mmap(). We also remove dubious shared logic in mm.h which simply does not belong there either. It makes sense to do these checks at the earliest opportunity, we know these are shmem (or hugetlbfs) mappings whose relevant VMA flags will not change from the invoking do_mmap() so there is simply no need to wait. This also means the implementation of further memfd seal flags can be done within mm/memfd.c and also have the opportunity to modify VMA flags as necessary early in the mapping logic. Link: https://lkml.kernel.org/r/20241206212846.210835-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jeff Xu <jeffxu@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:51 -08:00
Lorenzo Stoakes	e6a86f6fe9	mm: enforce __must_check on VMA merge and split It is of critical importance to check the return results on VMA merge (and split), failure to do so can result in use-after-free's. This bug has recurred, so have the compiler enforce this check to prevent any future repetition. Link: https://lkml.kernel.org/r/20241206225036.273103-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:51 -08:00
Andrew Morton	7f1e9a24ec	mm-damon-tests-vaddr-kunith-reduce-stack-consumption-fix fix build Cc: SeongJae Park <sj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:51 -08:00
Andrew Morton	61e5bbc2b8	mm/damon/tests/vaddr-kunit.h: reduce stack consumption After "mm: move per-vma lock into vm_area_struct" we're hitting mm/damon/tests/vaddr-kunit.h: In function 'damon_test_three_regions_in_vmas': mm/damon/tests/vaddr-kunit.h:92:1: error: the frame size of 3280 bytes is larger than 2048 bytes [-Werror=frame-larger-than=] Fix by moving all those vmas off the stack. Closes: https://lkml.kernel.org/r/20241209170829.11311e70@canb.auug.org.au Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:51 -08:00
Suren Baghdasaryan	19fbad905e	mm: convert mm_lock_seq to a proper seqcount Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock variants to increment it, in-line with the usual seqcount usage pattern. This lets us check whether the mmap_lock is write-locked by checking mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be used when implementing mmap_lock speculation functions. As a result vm_lock_seq is also change to be unsigned to match the type of mm_lock_seq.sequence. Link: https://lkml.kernel.org/r/20241122174416.1367052-2-surenb@google.com Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:50 -08:00
Guo Weikang	53d93d2915	mm/shmem: refactor to reuse vfs_parse_monolithic_sep for option parsing shmem_parse_options() is refactored to use vfs_parse_monolithic_sep() with a custom separator function, shmem_next_opt(). This eliminates redundant logic for parsing comma-separated options and ensures consistency with other kernel code that uses the same interface. The vfs_parse_monolithic_sep() helper was introduced in commit `e001d1447c` ("fs: factor out vfs_parse_monolithic_sep() helper"). Link: https://lkml.kernel.org/r/20241205094521.1244678-1-guoweikang.kernel@gmail.com Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:49 -08:00
Wenchao Hao	9de7704bf2	mm: add per-order mTHP swap-in fallback/fallback_charge counters Currently, large folio swap-in is supported, but we lack a method to analyze their success ratio. Similar to anon_fault_fallback, we introduce per-order mTHP swpin_fallback and swpin_fallback_charge counters for calculating their success ratio. The new counters are located at: /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats/ swpin_fallback swpin_fallback_charge Link: https://lkml.kernel.org/r/20241202124730.2407037-1-haowenchao22@gmail.com Signed-off-by: Wenchao Hao <haowenchao22@gmail.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:49 -08:00
Qi Zheng	f1fdbec3ff	mm: pgtable: make ptlock be freed by RCU If ALLOC_SPLIT_PTLOCKS is enabled, the ptdesc->ptl will be a pointer and a ptlock will be allocated for it, and it will be freed immediately before the PTE page is freed. Once we support empty PTE page reclaimation, it may result in the following use-after-free problem: CPU 0 CPU 1 pte_offset_map_rw_nolock(&ptlock) --> rcu_read_lock() madvise(MADV_DONTNEED) --> ptlock_free (free ptlock immediately!) free PTE page via RCU /* UAF!! */ spin_lock(ptlock) To avoid this problem, make ptlock also be freed by RCU. Link: https://lkml.kernel.org/r/20241210084431.91414-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reported-by: syzbot+1c58afed1cfd2f57efee@syzkaller.appspotmail.com Tested-by: syzbot+1c58afed1cfd2f57efee@syzkaller.appspotmail.com Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:48 -08:00
Qi Zheng	62e76fb4ff	x86: mm: free page table pages by RCU instead of semi RCU Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages will be freed by semi RCU, that is: - batch table freeing: asynchronous free by RCU - single table freeing: IPI + synchronous free In this way, the page table can be lockless traversed by disabling IRQ in paths such as fast GUP. But this is not enough to free the empty PTE page table pages in paths other that munmap and exit_mmap path, because IPI cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}(). In preparation for supporting empty PTE page table pages reclaimation, let single table also be freed by RCU like batch table freeing. Then we can also use pte_offset_map() etc to prevent PTE page from being freed. Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free the page table pages: - The pt_rcu_head is unioned with pt_list and pmd_huge_pte. - For pt_list, it is used to manage the PGD page in x86. Fortunately tlb_remove_table() will not be used for free PGD pages, so it is safe to use pt_rcu_head. - For pmd_huge_pte, it is used for THPs, so it is safe. After applying this patch, if CONFIG_PT_RECLAIM is enabled, the function call of free_pte() is as follows: free_pte pte_free_tlb __pte_free_tlb ___pte_free_tlb paravirt_tlb_remove_table tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM] [no-free-memory slowpath:] tlb_table_invalidate tlb_remove_table_one __tlb_remove_table_one [frees via RCU] [fastpath:] tlb_table_flush tlb_remove_table_free [frees via RCU] native_tlb_remove_table [CONFIG_PARAVIRT on native] tlb_remove_table [see above] Link: https://lkml.kernel.org/r/0287d442a973150b0e1019cc406e6322d148277a.1733305182.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:48 -08:00
Qi Zheng	9b20e24372	mm-pgtable-reclaim-empty-pte-page-in-madvisemadv_dontneed-fix Dan Carpenter reported the following warning: Commit `e3aafd2d35` ("mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)") from Dec 4, 2024 (linux-next), leads to the following Smatch static checker warning: mm/pt_reclaim.c:69 try_to_free_pte() error: uninitialized symbol 'ptl'. To fix it, assign an initial value of NULL to the ptl. Link: https://lkml.kernel.org/r/20241206112348.51570-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-mm/224e6a4e-43b5-4080-bdd8-b0a6fb2f0853@stanley.mountain/ Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:48 -08:00
Qi Zheng	3b139cbda7	mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. As a first step, this commit aims to synchronously free the empty PTE pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than madvise(MADV_DONTNEED). Once an empty PTE is detected, we first try to hold the pmd lock within the pte lock. If successful, we clear the pmd entry directly (fast path). Otherwise, we wait until the pte lock is released, then re-hold the pmd and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect whether the PTE page is empty and free it (slow path). For other cases such as madvise(MADV_FREE), consider scanning and freeing empty PTE pages asynchronously in the future. The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.1 MB VmPTE 102640 KB 240 KB Link: https://lkml.kernel.org/r/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:47 -08:00

1 2 3 4 5 ...

23636 Commits