mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-01-16 18:08:20 +00:00
23492 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Qi Zheng
|
c85507857b |
mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check in do_swap_page(). In other places, we directly use pte_offset_map_lock(), so convert it to using pte_offset_map_ro_nolock(). Link: https://lkml.kernel.org/r/dc97a6c3cb9ea80cab30c5626eeea79959d93258.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Qi Zheng
|
bd6ad65ddc |
mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_ro_nolock()
In filemap_fault_recheck_pte_none(), we just do pte_none() check, so convert it to using pte_offset_map_ro_nolock(). Link: https://lkml.kernel.org/r/9f7cbbaa772385ced1b8931b67a8b9d246c9b82d.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Qi Zheng
|
66efef9b1a |
mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
Patch series "introduce pte_offset_map_{ro|rw}_nolock()", v5. As proposed by David Hildenbrand [1], this series introduces the following two new helper functions to replace pte_offset_map_nolock(). 1. pte_offset_map_ro_nolock() 2. pte_offset_map_rw_nolock() As the name suggests, pte_offset_map_ro_nolock() is used for read-only case. In this case, only read-only operations will be performed on PTE page after the PTL is held. The RCU lock in pte_offset_map_nolock() will ensure that the PTE page will not be freed, and there is no need to worry about whether the pmd entry is modified. Therefore pte_offset_map_ro_nolock() is just a renamed version of pte_offset_map_nolock(). pte_offset_map_rw_nolock() is used for may-write case. In this case, the pte or pmd entry may be modified after the PTL is held, so we need to ensure that the pmd entry has not been modified concurrently. So in addition to the name change, it also outputs the pmdval when successful. The users should make sure the page table is stable like checking pte_same() or checking pmd_same() by using the output pmdval before performing the write operations. This series will convert all pte_offset_map_nolock() into the above two helper functions one by one, and finally completely delete it. This also a preparation for reclaiming the empty user PTE page table pages. This patch (of 13): Currently, the usage of pte_offset_map_nolock() can be divided into the following two cases: 1) After acquiring PTL, only read-only operations are performed on the PTE page. In this case, the RCU lock in pte_offset_map_nolock() will ensure that the PTE page will not be freed, and there is no need to worry about whether the pmd entry is modified. 2) After acquiring PTL, the pte or pmd entries may be modified. At this time, we need to ensure that the pmd entry has not been modified concurrently. To more clearing distinguish between these two cases, this commit introduces two new helper functions to replace pte_offset_map_nolock(). For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition to changing the name to pte_offset_map_rw_nolock(), it also outputs the pmdval when successful. It is applicable for may-write cases where any modification operations to the page table may happen after the corresponding spinlock is held afterwards. But the users should make sure the page table is stable like checking pte_same() or checking pmd_same() by using the output pmdval before performing the write operations. Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will be read-only/read-write protected. Subsequent commits will convert pte_offset_map_nolock() into the above two functions one by one, and finally completely delete it. Link: https://lkml.kernel.org/r/cover.1727332572.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/5aeecfa131600a454b1f3a038a1a54282ca3b856.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Nanyong Sun
|
f2f484085e |
mm: move mm flags to mm_types.h
The types of mm flags are now far beyond the core dump related features. This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h. The linux/sched/coredump.h has include the mm_types.h, so the C files related to coredump does not need to change head file inclusion. In addition, the inclusion of sched/coredump.h now can be deleted from the C files that irrelevant to core dump. Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
021781b012 |
mm/madvise: unrestrict process_madvise() for current process
The process_madvise() call was introduced in commit ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") as a means of performing madvise() operations on another process. However, as it provides the means by which to perform multiple madvise() operations in a batch via an iovec, it is useful to utilise the same interface for performing operations on the current process rather than a remote one. Commit 22af8caff7d1 ("mm/madvise: process_madvise() drop capability check if same mm") removed the need for a caller invoking process_madvise() on its own pidfd to possess the CAP_SYS_NICE capability, however this leaves the restrictions on operation in place. Resolve this by only applying the restriction on operations when accessing a remote process. Moving forward we plan to implement a simpler means of specifying this condition other than needing to establish a self pidfd, perhaps in the form of a sentinel pidfd. Also take the opportunity to refactor the system call implementation abstracting the vectorised operation. Link: https://lkml.kernel.org/r/20240926151019.82902-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Brauner <brauner@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Tanya Agarwal
|
1cd1a4e71b |
mm/mempolicy: fix comments for better documentation
Fix typo in mempolicy.h and Correct the number of allowed memory policy Link: https://lkml.kernel.org/r/20240926183516.4034-2-tanyaagarwal25699@gmail.com Signed-off-by: Tanya Agarwal <tanyaagarwal25699@gmail.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Cc: Anup Sharma <anupnewsmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zhiguo Jiang
|
bbc251f30e |
mm: fix shrink nr.unqueued_dirty counter issue
It is needed to ensure sc->nr.unqueued_dirty > 0, which can avoid setting PGDAT_DIRTY flag when sc->nr.unqueued_dirty and sc->nr.file_taken are both zero. Link: https://lkml.kernel.org/r/20240112012353.1387-1-justinjiang@vivo.com Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
cd3f8467af |
mm: refactor mm_access() to not return NULL
mm_access() can return NULL if the mm is not found, but this is handled the same as an error in all callers, with some translating this into an -ESRCH error. Only proc_mem_open() returns NULL if no mm is found, however in this case it is clearer and makes more sense to explicitly handle the error. Additionally we take the opportunity to refactor the function to eliminate unnecessary nesting. Simplify things by simply returning -ESRCH if no mm is found - this both eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies callers which would return -ESRCH by returning this error directly. [lorenzo.stoakes@oracle.com: prefer neater pointer error comparison] Link: https://lkml.kernel.org/r/2fae1834-749a-45e1-8594-5e5979cf7103@lucifer.local Link: https://lkml.kernel.org/r/20240924201023.193135-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Adrian Huang
|
9e9e085eff |
mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address into one operation
When compiling kernel source 'make -j $(nproc)' with the up-and-running KASAN-enabled kernel on a 256-core machine, the following soft lockup is shown: watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760] CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95 Workqueue: events drain_vmap_area_work RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0 Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75 RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202 RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949 RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50 RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800 R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39 R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0 Call Trace: <IRQ> ? watchdog_timer_fn+0x2cd/0x390 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x300/0x6d0 ? sched_clock_cpu+0x69/0x4e0 ? __pfx___hrtimer_run_queues+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? ktime_get_update_offsets_now+0x7f/0x2a0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? hrtimer_interrupt+0x2ca/0x760 ? __sysvec_apic_timer_interrupt+0x8c/0x2b0 ? sysvec_apic_timer_interrupt+0x6a/0x90 </IRQ> <TASK> ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? smp_call_function_many_cond+0x1d8/0xbb0 ? __pfx_do_kernel_range_flush+0x10/0x10 on_each_cpu_cond_mask+0x20/0x40 flush_tlb_kernel_range+0x19b/0x250 ? srso_return_thunk+0x5/0x5f ? kasan_release_vmalloc+0xa7/0xc0 purge_vmap_node+0x357/0x820 ? __pfx_purge_vmap_node+0x10/0x10 __purge_vmap_area_lazy+0x5b8/0xa10 drain_vmap_area_work+0x21/0x30 process_one_work+0x661/0x10b0 worker_thread+0x844/0x10e0 ? srso_return_thunk+0x5/0x5f ? __kthread_parkme+0x82/0x140 ? __pfx_worker_thread+0x10/0x10 kthread+0x2a5/0x370 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Debugging Analysis: 1. The following ftrace log shows that the lockup CPU spends too much time iterating vmap_nodes and flushing TLB when purging vm_area structures. (Some info is trimmed). kworker: funcgraph_entry: | drain_vmap_area_work() { kworker: funcgraph_entry: | mutex_lock() { kworker: funcgraph_entry: 1.092 us | __cond_resched(); kworker: funcgraph_exit: 3.306 us | } ... ... kworker: funcgraph_entry: | flush_tlb_kernel_range() { ... ... kworker: funcgraph_exit: # 7533.649 us | } ... ... kworker: funcgraph_entry: 2.344 us | mutex_unlock(); kworker: funcgraph_exit: $ 23871554 us | } The drain_vmap_area_work() spends over 23 seconds. There are 2805 flush_tlb_kernel_range() calls in the ftrace log. * One is called in __purge_vmap_area_lazy(). * Others are called by purge_vmap_node->kasan_release_vmalloc. purge_vmap_node() iteratively releases kasan vmalloc allocations and flushes TLB for each vmap_area. - [Rough calculation] Each flush_tlb_kernel_range() runs about 7.5ms. -- 2804 * 7.5ms = 21.03 seconds. -- That's why a soft lock is triggered. 2. Extending the soft lockup time can work around the issue (For example, # echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the above-mentioned speculation: drain_vmap_area_work() spends too much time. If we combine all TLB flush operations of the KASAN shadow virtual address into one operation in the call path 'purge_vmap_node()->kasan_release_vmalloc()', the running time of drain_vmap_area_work() can be saved greatly. The idea is from the flush_tlb_kernel_range() call in __purge_vmap_area_lazy(). And, the soft lockup won't be triggered. Here is the test result based on 6.10: [6.10 wo/ the patch] 1. ftrace latency profiling (record a trace if the latency > 20s). echo 20000000 > /sys/kernel/debug/tracing/tracing_thresh echo drain_vmap_area_work > /sys/kernel/debug/tracing/set_graph_function echo function_graph > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/kernel/debug/tracing/tracing_on 2. Run `make -j $(nproc)` to compile the kernel source 3. Once the soft lockup is reproduced, check the ftrace log: cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 76) $ 50412985 us | } /* __purge_vmap_area_lazy */ 76) $ 50412997 us | } /* drain_vmap_area_work */ 76) $ 29165911 us | } /* __purge_vmap_area_lazy */ 76) $ 29165926 us | } /* drain_vmap_area_work */ 91) $ 53629423 us | } /* __purge_vmap_area_lazy */ 91) $ 53629434 us | } /* drain_vmap_area_work */ 91) $ 28121014 us | } /* __purge_vmap_area_lazy */ 91) $ 28121026 us | } /* drain_vmap_area_work */ [6.10 w/ the patch] 1. Repeat step 1-2 in "[6.10 wo/ the patch]" 2. The soft lockup is not triggered and ftrace log is empty. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 3. Setting 'tracing_thresh' to 10/5 seconds does not get any ftrace log. 4. Setting 'tracing_thresh' to 1 second gets ftrace log. cat /sys/kernel/debug/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 23) $ 1074942 us | } /* __purge_vmap_area_lazy */ 23) $ 1074950 us | } /* drain_vmap_area_work */ The worst execution time of drain_vmap_area_work() is about 1 second. Link: https://lore.kernel.org/lkml/ZqFlawuVnOMY2k3E@pc638.lan/ Link: https://lkml.kernel.org/r/20240726165246.31326-1-ahuang12@lenovo.com Fixes: 282631cb2447 ("mm: vmalloc: remove global purge_vmap_area_root rb-tree") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Co-developed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jingxiang Zeng
|
15ff4d409e |
mm/memcontrol: add per-memcg pgpgin/pswpin counter
In proactive memory reclamation scenarios, it is necessary to estimate the pswpin and pswpout metrics of the cgroup to determine whether to continue reclaiming anonymous pages in the current batch. This patch will collect these metrics and expose them. [linuszeng@tencent.com: v2] Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com Li nk: https://lkml.kernel.org/r/20240913084453.3605621-1-jingxiangzeng.cas@gmail.com Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Leo Stone
|
ba7196e566 |
mm/damon: fix sparse warning for zero initializer
sparse warns about zero initializing an array with {0,}, change it to the equivalent {0}. Fixes the sparse warning: mm/damon/tests/vaddr-kunit.h:69:47: warning: missing braces around initializer Link: https://lkml.kernel.org/r/xriwklcwjpwcz7eiavo6f7envdar4jychhsk6sfkj5klaznb6b@j6vrvr2sxjht Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: Leo Stone <leocstone@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Baolin Wang
|
d2d243df44 |
mm: shmem: fix khugepaged activation policy for shmem
Shmem has a separate interface (different from anonymous pages) to control huge page allocation, that means shmem THP can be enabled while anonymous THP is disabled. However, in this case, khugepaged will not start to collapse shmem THP, which is unreasonable. To fix this issue, we should call start_stop_khugepaged() to activate or deactivate the khugepaged thread when setting shmem mTHP interfaces. Moreover, add a new helper shmem_hpage_pmd_enabled() to help to check whether shmem THP is enabled, which will determine if khugepaged should be activated. Link: https://lkml.kernel.org/r/9b9c6cbc4499bf44c6455367fd9e0f6036525680.1726978977.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
5de195060b |
mm: resolve faulty mmap_region() error path behaviour
The mmap_region() function is somewhat terrifying, with spaghetti-like control flow and numerous means by which issues can arise and incomplete state, memory leaks and other unpleasantness can occur. A large amount of the complexity arises from trying to handle errors late in the process of mapping a VMA, which forms the basis of recently observed issues with resource leaks and observable inconsistent state. Taking advantage of previous patches in this series we move a number of checks earlier in the code, simplifying things by moving the core of the logic into a static internal function __mmap_region(). Doing this allows us to perform a number of checks up front before we do any real work, and allows us to unwind the writable unmap check unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE validation unconditionally also. We move a number of things here: 1. We preallocate memory for the iterator before we call the file-backed memory hook, allowing us to exit early and avoid having to perform complicated and error-prone close/free logic. We carefully free iterator state on both success and error paths. 2. The enclosing mmap_region() function handles the mapping_map_writable() logic early. Previously the logic had the mapping_map_writable() at the point of mapping a newly allocated file-backed VMA, and a matching mapping_unmap_writable() on success and error paths. We now do this unconditionally if this is a file-backed, shared writable mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however doing so does not invalidate the seal check we just performed, and we in any case always decrement the counter in the wrapper. We perform a debug assert to ensure a driver does not attempt to do the opposite. 3. We also move arch_validate_flags() up into the mmap_region() function. This is only relevant on arm64 and sparc64, and the check is only meaningful for SPARC with ADI enabled. We explicitly add a warning for this arch if a driver invalidates this check, though the code ought eventually to be fixed to eliminate the need for this. With all of these measures in place, we no longer need to explicitly close the VMA on error paths, as we place all checks which might fail prior to a call to any driver mmap hook. This eliminates an entire class of errors, makes the code easier to reason about and more robust. Link: https://lkml.kernel.org/r/6e0becb36d2f5472053ac5d544c0edfe9b899e25.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Mark Brown <broonie@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
5baf8b037d |
mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
Currently MTE is permitted in two circumstances (desiring to use MTE having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is specified, as checked by arch_calc_vm_flag_bits() and actualised by setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap hook is activated in mmap_region(). The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also set is the arm64 implementation of arch_validate_flags(). Unfortunately, we intend to refactor mmap_region() to perform this check earlier, meaning that in the case of a shmem backing we will not have invoked shmem_mmap() yet, causing the mapping to fail spuriously. It is inappropriate to set this architecture-specific flag in general mm code anyway, so a sensible resolution of this issue is to instead move the check somewhere else. We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via the arch_calc_vm_flag_bits() call. This is an appropriate place to do this as we already check for the MAP_ANONYMOUS case here, and the shmem file case is simply a variant of the same idea - we permit RAM-backed memory. This requires a modification to the arch_calc_vm_flag_bits() signature to pass in a pointer to the struct file associated with the mapping, however this is not too egregious as this is only used by two architectures anyway - arm64 and parisc. So this patch performs this adjustment and removes the unnecessary assignment of VM_MTE_ALLOWED in shmem_mmap(). [akpm@linux-foundation.org: fix whitespace, per Catalin] Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andreas Larsson <andreas@gaisler.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
0fb4a7ad27 |
mm: refactor map_deny_write_exec()
Refactor the map_deny_write_exec() to not unnecessarily require a VMA parameter but rather to accept VMA flags parameters, which allows us to use this function early in mmap_region() in a subsequent commit. While we're here, we refactor the function to be more readable and add some additional documentation. Link: https://lkml.kernel.org/r/6be8bb59cd7c68006ebb006eb9d8dc27104b1f70.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
4080ef1579 |
mm: unconditionally close VMAs on error
Incorrect invocation of VMA callbacks when the VMA is no longer in a consistent state is bug prone and risky to perform. With regards to the important vm_ops->close() callback We have gone to great lengths to try to track whether or not we ought to close VMAs. Rather than doing so and risking making a mistake somewhere, instead unconditionally close and reset vma->vm_ops to an empty dummy operations set with a NULL .close operator. We introduce a new function to do so - vma_close() - and simplify existing vms logic which tracked whether we needed to close or not. This simplifies the logic, avoids incorrect double-calling of the .close() callback and allows us to update error paths to simply call vma_close() unconditionally - making VMA closure idempotent. Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
3dd6ed34ce |
mm: avoid unsafe VMA hook invocation when error arises on mmap hook
Patch series "fix error handling in mmap_region() and refactor (hotfixes)", v4. mmap_region() is somewhat terrifying, with spaghetti-like control flow and numerous means by which issues can arise and incomplete state, memory leaks and other unpleasantness can occur. A large amount of the complexity arises from trying to handle errors late in the process of mapping a VMA, which forms the basis of recently observed issues with resource leaks and observable inconsistent state. This series goes to great lengths to simplify how mmap_region() works and to avoid unwinding errors late on in the process of setting up the VMA for the new mapping, and equally avoids such operations occurring while the VMA is in an inconsistent state. The patches in this series comprise the minimal changes required to resolve existing issues in mmap_region() error handling, in order that they can be hotfixed and backported. There is additionally a follow up series which goes further, separated out from the v1 series and sent and updated separately. This patch (of 5): After an attempted mmap() fails, we are no longer in a situation where we can safely interact with VMA hooks. This is currently not enforced, meaning that we need complicated handling to ensure we do not incorrectly call these hooks. We can avoid the whole issue by treating the VMA as suspect the moment that the file->f_ops->mmap() function reports an error by replacing whatever VMA operations were installed with a dummy empty set of VMA operations. We do so through a new helper function internal to mm - mmap_file() - which is both more logically named than the existing call_mmap() function and correctly isolates handling of the vm_op reassignment to mm. All the existing invocations of call_mmap() outside of mm are ultimately nested within the call_mmap() from mm, which we now replace. It is therefore safe to leave call_mmap() in place as a convenience function (and to avoid churn). The invokers are: ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap() coda_file_operations -> mmap -> coda_file_mmap() shm_file_operations -> shm_mmap() shm_file_operations_huge -> shm_mmap() dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops -> i915_gem_dmabuf_mmap() None of these callers interact with vm_ops or mappings in a problematic way on error, quickly exiting out. Link: https://lkml.kernel.org/r/cover.1730224667.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d41fd763496fd0048a962f3fd9407dc72dd4fd86.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
f8f931bba0 |
mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
e66f3185fa |
mm/thp: fix deferred split queue not partially_mapped
Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. The new unlocked list_del_init() in deferred_split_scan() is buggy. I gave bad advice, it looks plausible since that's a local on-stack list, but the fact is that it can race with a third party freeing or migrating the preceding folio (properly unqueueing it with refcount 0 while holding split_queue_lock), thereby corrupting the list linkage. The obvious answer would be to take split_queue_lock there: but it has a long history of contention, so I'm reluctant to add to that. Instead, make sure that there is always one safe (raised refcount) folio before, by delaying its folio_put(). (And of course I was wrong to suggest updating split_queue_len without the lock: leave that until the splice.) And remove two over-eager partially_mapped checks, restoring those tests to how they were before: if uncharge_folio() or free_tail_page_prepare() finds _deferred_list non-empty, it's in trouble whether or not that folio is partially_mapped (and the flag was already cleared in the latter case). Link: https://lkml.kernel.org/r/81e34a8b-113a-0701-740e-2135c97eb1d7@google.com Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Koichiro Den
|
9c9201afeb |
mm/slab: fix warning caused by duplicate kmem_cache creation in kmem_buckets_create
Commit b035f5a6d852 ("mm: slab: reduce the kmalloc() minimum alignment if DMA bouncing possible") reduced ARCH_KMALLOC_MINALIGN to 8 on arm64. However, with KASAN_HW_TAGS enabled, arch_slab_minalign() becomes 16. This causes kmalloc_caches[*][8] to be aliased to kmalloc_caches[*][16], resulting in kmem_buckets_create() attempting to create a kmem_cache for size 16 twice. This duplication triggers warnings on boot: [ 2.325108] ------------[ cut here ]------------ [ 2.325135] kmem_cache of name 'memdup_user-16' already exists [ 2.325783] WARNING: CPU: 0 PID: 1 at mm/slab_common.c:107 __kmem_cache_create_args+0xb8/0x3b0 [ 2.327957] Modules linked in: [ 2.328550] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc5mm-unstable-arm64+ #12 [ 2.328683] Hardware name: QEMU QEMU Virtual Machine, BIOS 2024.02-2 03/11/2024 [ 2.328790] pstate: 61000009 (nZCv daif -PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 2.328911] pc : __kmem_cache_create_args+0xb8/0x3b0 [ 2.328930] lr : __kmem_cache_create_args+0xb8/0x3b0 [ 2.328942] sp : ffff800083d6fc50 [ 2.328961] x29: ffff800083d6fc50 x28: f2ff0000c1674410 x27: ffff8000820b0598 [ 2.329061] x26: 000000007fffffff x25: 0000000000000010 x24: 0000000000002000 [ 2.329101] x23: ffff800083d6fce8 x22: ffff8000832222e8 x21: ffff800083222388 [ 2.329118] x20: f2ff0000c1674410 x19: f5ff0000c16364c0 x18: ffff800083d80030 [ 2.329135] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 2.329152] x14: 0000000000000000 x13: 0a73747369786520 x12: 79646165726c6120 [ 2.329169] x11: 656820747563205b x10: 2d2d2d2d2d2d2d2d x9 : 0000000000000000 [ 2.329194] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 2.329210] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 2.329226] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 2.329291] Call trace: [ 2.329407] __kmem_cache_create_args+0xb8/0x3b0 [ 2.329499] kmem_buckets_create+0xfc/0x320 [ 2.329526] init_user_buckets+0x34/0x78 [ 2.329540] do_one_initcall+0x64/0x3c8 [ 2.329550] kernel_init_freeable+0x26c/0x578 [ 2.329562] kernel_init+0x3c/0x258 [ 2.329574] ret_from_fork+0x10/0x20 [ 2.329698] ---[ end trace 0000000000000000 ]--- [ 2.403704] ------------[ cut here ]------------ [ 2.404716] kmem_cache of name 'msg_msg-16' already exists [ 2.404801] WARNING: CPU: 2 PID: 1 at mm/slab_common.c:107 __kmem_cache_create_args+0xb8/0x3b0 [ 2.404842] Modules linked in: [ 2.404971] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.12.0-rc5mm-unstable-arm64+ #12 [ 2.405026] Tainted: [W]=WARN [ 2.405043] Hardware name: QEMU QEMU Virtual Machine, BIOS 2024.02-2 03/11/2024 [ 2.405057] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2.405079] pc : __kmem_cache_create_args+0xb8/0x3b0 [ 2.405100] lr : __kmem_cache_create_args+0xb8/0x3b0 [ 2.405111] sp : ffff800083d6fc50 [ 2.405115] x29: ffff800083d6fc50 x28: fbff0000c1674410 x27: ffff8000820b0598 [ 2.405135] x26: 000000000000ffd0 x25: 0000000000000010 x24: 0000000000006000 [ 2.405153] x23: ffff800083d6fce8 x22: ffff8000832222e8 x21: ffff800083222388 [ 2.405169] x20: fbff0000c1674410 x19: fdff0000c163d6c0 x18: ffff800083d80030 [ 2.405185] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 2.405201] x14: 0000000000000000 x13: 0a73747369786520 x12: 79646165726c6120 [ 2.405217] x11: 656820747563205b x10: 2d2d2d2d2d2d2d2d x9 : 0000000000000000 [ 2.405233] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 2.405248] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 2.405271] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 2.405287] Call trace: [ 2.405293] __kmem_cache_create_args+0xb8/0x3b0 [ 2.405305] kmem_buckets_create+0xfc/0x320 [ 2.405315] init_msg_buckets+0x34/0x78 [ 2.405326] do_one_initcall+0x64/0x3c8 [ 2.405337] kernel_init_freeable+0x26c/0x578 [ 2.405348] kernel_init+0x3c/0x258 [ 2.405360] ret_from_fork+0x10/0x20 [ 2.405370] ---[ end trace 0000000000000000 ]--- To address this, alias kmem_cache for sizes smaller than min alignment to the aligned sized kmem_cache, as done with the default system kmalloc bucket. Fixes: b32801d1255b ("mm/slab: Introduce kmem_buckets_create() and family") Cc: <stable@vger.kernel.org> # v6.11+ Signed-off-by: Koichiro Den <koichiro.den@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
Joanne Koong
|
7fce207af5 |
mm/writeback: add folio_mark_dirty_lock()
Add a new convenience helper folio_mark_dirty_lock() that grabs the folio lock before calling folio_mark_dirty(). Refactor set_page_dirty_lock() to directly use folio_mark_dirty_lock(). Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> |
||
Linus Torvalds
|
a8cc743272 |
17 hotfixes. 9 are cc:stable. 13 are MM and 4 are non-MM.
The usual collection of singletons - please see the changelogs. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZyfGDAAKCRDdBJ7gKXxA jr19AQD6bfDF/6L2Alq1QG26pgrgccEbKzDSzR6pBajwCbdrNQD/XPhiv3zRJfGf lgt0Qkqwe/ApBhVYUnL8y1CePv3EDgA= =W5W0 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes. 9 are cc:stable. 13 are MM and 4 are non-MM. The usual collection of singletons - please see the changelogs" * tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify() mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes mm: shrinker: avoid memleak in alloc_shrinker_info .mailmap: update e-mail address for Eugen Hristev vmscan,migrate: fix page count imbalance on node stats when demoting pages mailmap: update Jarkko's email addresses mm: allow set/clear page_type again nilfs2: fix potential deadlock with newly created symlinks Squashfs: fix variable overflow in squashfs_readpage_block kasan: remove vmalloc_percpu test tools/mm: -Werror fixes in page-types/slabinfo mm, swap: avoid over reclaim of full clusters mm: fix PSWPIN counter for large folios swap-in mm: avoid VM_BUG_ON when try to map an anon large folio to zero page. mm/codetag: fix null pointer check logic for ref and tag mm/gup: stop leaking pinned pages in low memory conditions |
||
Yu Zhao
|
1d4832becd |
mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()
When the MM_WALK capability is enabled, memory that is mostly accessed by a VM appears younger than it really is, therefore this memory will be less likely to be evicted. Therefore, the presence of a running VM can significantly increase swap-outs for non-VM memory, regressing the performance for the rest of the system. Fix this regression by always calling {ptep,pmdp}_clear_young_notify() whenever we clear the young bits on PMDs/PTEs. [jthoughton@google.com: fix link-time error] Link: https://lkml.kernel.org/r/20241019012940.3656292-3-jthoughton@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Reported-by: David Stevens <stevensd@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yu Zhao
|
ddd6d8e975 |
mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
Patch series "mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK". Today, the MM_WALK capability causes MGLRU to clear the young bit from PMDs and PTEs during the page table walk before eviction, but MGLRU does not call the clear_young() MMU notifier in this case. By not calling this notifier, the MM walk takes less time/CPU, but it causes pages that are accessed mostly through KVM / secondary MMUs to appear younger than they should be. We do call the clear_young() notifier today, but only when attempting to evict the page, so we end up clearing young/accessed information less frequently for secondary MMUs than for mm PTEs, and therefore they appear younger and are less likely to be evicted. Therefore, memory that is *not* being accessed mostly by KVM will be evicted *more* frequently, worsening performance. ChromeOS observed a tab-open latency regression when enabling MGLRU with a setup that involved running a VM: Tab-open latency histogram (ms) Version p50 mean p95 p99 max base 1315 1198 2347 3454 10319 mglru 2559 1311 7399 12060 43758 fix 1119 926 2470 4211 6947 This series replaces the final non-selftest patchs from this series[1], which introduced a similar change (and a new MMU notifier) with KVM optimizations. I'll send a separate series (to Sean and Paolo) for the KVM optimizations. This series also makes proactive reclaim with MGLRU possible for KVM memory. I have verified that this functions correctly with the selftest from [1], but given that that test is a KVM selftest, I'll send it with the rest of the KVM optimizations later. Andrew, let me know if you'd like to take the test now anyway. [1]: https://lore.kernel.org/linux-mm/20240926013506.860253-18-jthoughton@google.com/ This patch (of 2): The removed stats, MM_LEAF_OLD and MM_NONLEAF_TOTAL, are not very helpful and become more complicated to properly compute when adding test/clear_young() notifiers in MGLRU's mm walk. Link: https://lkml.kernel.org/r/20241019012940.3656292-1-jthoughton@google.com Link: https://lkml.kernel.org/r/20241019012940.3656292-2-jthoughton@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: David Stevens <stevensd@google.com> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Al Viro
|
7133dd5ac6 |
memcg_write_event_control(): switch to CLASS(fd)
some reordering required - take both fdget() to the point before the allocations, with matching move of fdput() to the very end of failure exit(s); after that it converts trivially. simplify the cleanups that involve css_put(), while we are at it... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
Al Viro
|
65c8941e7d |
convert cachestat(2)
fdput() can be transposed with copy_to_user() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
Al Viro
|
6348be02ee |
fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
Vlastimil Babka
|
d4148aeab4 |
mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") a mmap() of anonymous memory without a specific address hint and of at least PMD_SIZE will be aligned to PMD so that it can benefit from a THP backing page. However this change has been shown to regress some workloads significantly. [1] reports regressions in various spec benchmarks, with up to 600% slowdown of the cactusBSSN benchmark on some platforms. The benchmark seems to create many mappings of 4632kB, which would have merged to a large THP-backed area before commit efa7df3e3bb5 and now they are fragmented to multiple areas each aligned to PMD boundary with gaps between. The regression then seems to be caused mainly due to the benchmark's memory access pattern suffering from TLB or cache aliasing due to the aligned boundaries of the individual areas. Another known regression bisected to commit efa7df3e3bb5 is darktable [2] [3] and early testing suggests this patch fixes the regression there as well. To fix the regression but still try to benefit from THP-friendly anonymous mapping alignment, add a condition that the size of the mapping must be a multiple of PMD size instead of at least PMD size. In case of many odd-sized mapping like the cactusBSSN creates, those will stop being aligned and with gaps between, and instead naturally merge again. Link: https://lkml.kernel.org/r/20241024151228.101841-2-vbabka@suse.cz Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Michael Matz <matz@suse.de> Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1] Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2] Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3] Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Cc: Rik van Riel <riel@surriel.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Chen Ridong
|
15e8156713 |
mm: shrinker: avoid memleak in alloc_shrinker_info
A memleak was found as below: unreferenced object 0xffff8881010d2a80 (size 32): comm "mkdir", pid 1559, jiffies 4294932666 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 @............... backtrace (crc 2e7ef6fa): [<ffffffff81372754>] __kmalloc_node_noprof+0x394/0x470 [<ffffffff813024ab>] alloc_shrinker_info+0x7b/0x1a0 [<ffffffff813b526a>] mem_cgroup_css_online+0x11a/0x3b0 [<ffffffff81198dd9>] online_css+0x29/0xa0 [<ffffffff811a243d>] cgroup_apply_control_enable+0x20d/0x360 [<ffffffff811a5728>] cgroup_mkdir+0x168/0x5f0 [<ffffffff8148543e>] kernfs_iop_mkdir+0x5e/0x90 [<ffffffff813dbb24>] vfs_mkdir+0x144/0x220 [<ffffffff813e1c97>] do_mkdirat+0x87/0x130 [<ffffffff813e1de9>] __x64_sys_mkdir+0x49/0x70 [<ffffffff81f8c928>] do_syscall_64+0x68/0x140 [<ffffffff8200012f>] entry_SYSCALL_64_after_hwframe+0x76/0x7e alloc_shrinker_info(), when shrinker_unit_alloc() returns an errer, the info won't be freed. Just fix it. Link: https://lkml.kernel.org/r/20241025060942.1049263-1-chenridong@huaweicloud.com Fixes: 307bececcd12 ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}") Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Wang Weiyang <wangweiyang2@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Gregory Price
|
35e41024c4 |
vmscan,migrate: fix page count imbalance on node stats when demoting pages
When numa balancing is enabled with demotion, vmscan will call migrate_pages when shrinking LRUs. migrate_pages will decrement the the node's isolated page count, leading to an imbalanced count when invoked from (MG)LRU code. The result is dmesg output like such: $ cat /proc/sys/vm/stat_refresh [77383.088417] vmstat_refresh: nr_isolated_anon -103212 [77383.088417] vmstat_refresh: nr_isolated_file -899642 This negative value may impact compaction and reclaim throttling. The following path produces the decrement: shrink_folio_list demote_folio_list migrate_pages migrate_pages_batch migrate_folio_move migrate_folio_done mod_node_page_state(-ve) <- decrement This path happens for SUCCESSFUL migrations, not failures. Typically callers to migrate_pages are required to handle putback/accounting for failures, but this is already handled in the shrink code. When accounting for migrations, instead do not decrement the count when the migration reason is MR_DEMOTION. As of v6.11, this demotion logic is the only source of MR_DEMOTION. Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim") Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
330d8df81f |
kasan: remove vmalloc_percpu test
Commit 1a2473f0cbc0 ("kasan: improve vmalloc tests") added the vmalloc_percpu KASAN test with the assumption that __alloc_percpu always uses vmalloc internally, which is tagged by KASAN. However, __alloc_percpu might allocate memory from the first per-CPU chunk, which is not allocated via vmalloc(). As a result, the test might fail. Remove the test until proper KASAN annotation for the per-CPU allocated are added; tracked in https://bugzilla.kernel.org/show_bug.cgi?id=215019. Link: https://lkml.kernel.org/r/20241022160706.38943-1-andrey.konovalov@linux.dev Fixes: 1a2473f0cbc0 ("kasan: improve vmalloc tests") Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reported-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/all/4a245fff-cc46-44d1-a5f9-fd2f1c3764ae@sifive.com/ Reported-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Link: https://lore.kernel.org/all/CACzwLxiWzNqPBp4C1VkaXZ2wDwvY3yZeetCi1TLGFipKW77drA@mail.gmail.com/ Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kairui Song
|
5168a68eb7 |
mm, swap: avoid over reclaim of full clusters
When running low on usable slots, cluster allocator will try to reclaim the full clusters aggressively to reclaim HAS_CACHE slots. This guarantees that as long as there are any usable slots, HAS_CACHE or not, the swap device will be usable and workload won't go OOM early. Before the cluster allocator, swap allocator fails easily if device is filled up with reclaimable HAS_CACHE slots. Which can be easily reproduced with following simple program: #include <stdio.h> #include <string.h> #include <linux/mman.h> #include <sys/mman.h> #define SIZE 8192UL * 1024UL * 1024UL int main(int argc, char **argv) { long tmp; char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(p, 0, SIZE); madvise(p, SIZE, MADV_PAGEOUT); for (unsigned long i = 0; i < SIZE; ++i) tmp += p[i]; getchar(); /* Pause */ return 0; } Setup an 8G non ramdisk swap, the first run of the program will swapout 8G ram successfully. But run same program again after the first run paused, the second run can't swapout all 8G memory as now half of the swap device is pinned by HAS_CACHE. There was a random scan in the old allocator that may reclaim part of the HAS_CACHE by luck, but it's unreliable. The new allocator's added reclaim of full clusters when device is low on usable slots. But when multiple CPUs are seeing the device is low on usable slots at the same time, they ran into a thundering herd problem. This is an observable problem on large machine with mass parallel workload, as full cluster reclaim is slower on large swap device and higher number of CPUs will also make things worse. Testing using a 128G ZRAM on a 48c96t system. When the swap device is very close to full (eg. 124G / 128G), running build linux kernel with make -j96 in a 1G memory cgroup will hung (not a softlockup though) spinning in full cluster reclaim for about ~5min before go OOM. To solve this, split the full reclaim into two parts: - Instead of do a synchronous aggressively reclaim when device is low, do only one aggressively reclaim when device is strictly full with a kworker. This still ensures in worst case the device won't be unusable because of HAS_CACHE slots. - To avoid allocation (especially higher order) suffer from HAS_CACHE filling up clusters and kworker not responsive enough, do one synchronous scan every time the free list is drained, and only scan one cluster. This is kind of similar to the random reclaim before, keeps the full clusters rotated and has a minimal latency. This should provide a fair reclaim strategy suitable for most workloads. Link: https://lkml.kernel.org/r/20241022175512.10398-1-ryncsn@gmail.com Fixes: 2cacbdfdee65 ("mm: swap: add a adaptive full cluster cache reclaim") Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Barry Song
|
b54e1bfecc |
mm: fix PSWPIN counter for large folios swap-in
Similar to PSWPOUT, we should count the number of base pages instead of large folios. Link: https://lkml.kernel.org/r/20241023210201.2798-1-21cnbao@gmail.com Fixes: 242d12c98174 ("mm: support large folios swap-in for sync io devices") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
e0fc203748 |
mm: avoid VM_BUG_ON when try to map an anon large folio to zero page.
An anonymous large folio can be split into non order-0 folios, try_to_map_unused_to_zeropage() should not VM_BUG_ON compound pages but just return false. This fixes the crash when splitting anonymous large folios to non order-0 folios. Link: https://lkml.kernel.org/r/20241023171236.1122535-1-ziy@nvidia.com Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
John Hubbard
|
aa6f8b2593 |
mm/gup: stop leaking pinned pages in low memory conditions
If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM) family of functions, and requests "too many" pages, then the call will erroneously leave pages pinned. This is visible in user space as an actual memory leak. Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls to exhaust memory. The root cause of the problem is this sequence, within __gup_longterm_locked(): __get_user_pages_locked() rc = check_and_migrate_movable_pages() ...which gets retried in a loop. The loop error handling is incomplete, clearly due to a somewhat unusual and complicated tri-state error API. But anyway, if -ENOMEM, or in fact, any unexpected error is returned from check_and_migrate_movable_pages(), then __gup_longterm_locked() happily returns the error, while leaving the pages pinned. In the failed case, which is an app that requests (via a device driver) 30720000000 bytes to be pinned, and then exits, I see this: $ grep foll /proc/vmstat nr_foll_pin_acquired 7502048 nr_foll_pin_released 2048 And after applying this patch, it returns to balanced pins: $ grep foll /proc/vmstat nr_foll_pin_acquired 7502048 nr_foll_pin_released 7502048 Note that the child routine, check_and_migrate_movable_folios(), avoids this problem, by unpinning any folios in the **folios argument, before returning an error. Fix this by making check_and_migrate_movable_pages() behave in exactly the same way as check_and_migrate_movable_folios(): unpin all pages in **pages, before returning an error. Also, documentation was an aggravating factor, so: 1) Consolidate the documentation for these two routines, now that they have identical external behavior. 2) Rewrite the consolidated documentation: a) Clearly list the three return code cases, and what happens in each case. b) Mention that one of the cases unpins the pages or folios, before returning an error code. Link: https://lkml.kernel.org/r/20241018223411.310331-1-jhubbard@nvidia.com Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Shigeru Yoshida <syoshida@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
7fbaacafbc |
slab fixes for 6.12-rc6
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmcgrxcACgkQu+CwddJF iJrq9ggAiZ/2c7p23s52LdVhT9GTyV5omVOh2kDztVx4w6RM3RbkhkLWdqt0XUag uf1TJe6kOvnCeHEFEEo3sqPj820XebxKDf0GGCdI6a9f4n30ipKH+vWSQ0iutKO/ dOBdArxr0FGOV5VZR9i3xQ6sUqZXXUbJdte0c0ovp6Q6HDHTeQeKNhOQ2fv33TG/ 7jBh5HVyhI6JE/+TOxrMaklH0IqYBb6z49wdbaN7XBvXVXlb5MtOZy109gfUHDwe tfktifyE45VtmF0WdHfxDbCnqyDSG1Jm3wsLDbMq+voJ1BQlUvIZ5Dv4kucYqffm VN5HkH6uQ09aoounBoU4g50UYeNpiQ== =xAw8 -----END PGP SIGNATURE----- Merge tag 'slab-for-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - Fix for a slub_kunit test warning with MEM_ALLOC_PROFILING_DEBUG (Pei Xiao) - Fix for a MTE-based KASAN BUG in krealloc() (Qun-Wei Lin) * tag 'slab-for-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm: krealloc: Fix MTE false alarm in __do_krealloc slub/kunit: fix a WARNING due to unwrapped __kmalloc_cache_noprof |
||
Christoph Lameter
|
f7c80fad6c |
SLUB: Add support for per object memory policies
The old SLAB allocator used to support memory policies on a per allocation bases. In SLUB the memory policies are applied on a per page frame / folio bases. Doing so avoids having to check memory policies in critical code paths for kmalloc and friends. This worked on general well on Intel/AMD/PowerPC because the interconnect technology is mature and can minimize the latencies through intelligent caching even if a small object is not placed optimally. However, on ARM we have an emergence of new NUMA interconnect technology based more on embedded devices. Caching of remote content can currently be ineffective using the standard building blocks / mesh available on that platform. Such architectures benefit if each slab object is individually placed according to memory policies and other restrictions. This patch adds another kernel parameter slab_strict_numa If that is set then a static branch is activated that will cause the hotpaths of the allocator to evaluate the current memory allocation policy. Each object will be properly placed by paying the price of extra processing and SLUB will no longer defer to the page allocator to apply memory policies at the folio level. This patch improves performance of memcached running on Ampere Altra 2P system (ARM Neoverse N1 processor) by 3.6% due to accurate placement of small kernel objects. Tested-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
Vlastimil Babka
|
b6da940130 |
mm, slab: add kerneldocs for common SLAB_ flags
We have many SLAB_ flags but many are used only internally, by kunit tests or debugging subsystems cooperating with slab, or are set according to slab_debug boot parameter. Create kerneldocs for the commonly used flags that may be passed to kmem_cache_create(). SLAB_TYPESAFE_BY_RCU already had a detailed description, so turn it to a kerneldoc. Add some details for SLAB_ACCOUNT, SLAB_RECLAIM_ACCOUNT and SLAB_HWCACHE_ALIGN. Reference them from the __kmem_cache_create_args() kerneldoc. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
Zhen Lei
|
b4b797d877 |
mm/slab: remove duplicate check in create_cache()
The WARN_ON() check in static function create_cache() is done by its only parent __kmem_cache_create_args() before calling it. if (... || WARN_ON(... || object_size - args->usersize < args->useroffset)) args->usersize = args->useroffset = 0; ... s = create_cache(cache_name, object_size, args, flags); Therefore, the WARN_ON() check in create_cache() can be safely removed. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
Feng Tang
|
1e4df1859e |
mm/slub: Move krealloc() and related code to slub.c
This is a preparation for the following refactoring of krealloc(), for more efficient function calling as it will call some internal functions defined in slub.c. Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
Feng Tang
|
fb5eda0dfe |
mm/kasan: Don't store metadata inside kmalloc object when slub_debug_orig_size is on
For a kmalloc object, when both kasan and slub redzone sanity check are enabled, they could both manipulate its data space like storing kasan free meta data and setting up kmalloc redzone, and may affect accuracy of that object's 'orig_size'. As an accurate 'orig_size' will be needed by some function like krealloc() soon, save kasan's free meta data in slub's metadata area instead of inside object when 'orig_size' is enabled. This will make it easier to maintain/understand the code. Size wise, when these two options are both enabled, the slub meta data space is already huge, and this just slightly increase the overall size. Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
Qun-Wei Lin
|
704573851b |
mm: krealloc: Fix MTE false alarm in __do_krealloc
This patch addresses an issue introduced by commit 1a83a716ec233 ("mm: krealloc: consider spare memory for __GFP_ZERO") which causes MTE (Memory Tagging Extension) to falsely report a slab-out-of-bounds error. The problem occurs when zeroing out spare memory in __do_krealloc. The original code only considered software-based KASAN and did not account for MTE. It does not reset the KASAN tag before calling memset, leading to a mismatch between the pointer tag and the memory tag, resulting in a false positive. Example of the error: ================================================================== swapper/0: BUG: KASAN: slab-out-of-bounds in __memset+0x84/0x188 swapper/0: Write at addr f4ffff8005f0fdf0 by task swapper/0/1 swapper/0: Pointer tag: [f4], memory tag: [fe] swapper/0: swapper/0: CPU: 4 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12. swapper/0: Hardware name: MT6991(ENG) (DT) swapper/0: Call trace: swapper/0: dump_backtrace+0xfc/0x17c swapper/0: show_stack+0x18/0x28 swapper/0: dump_stack_lvl+0x40/0xa0 swapper/0: print_report+0x1b8/0x71c swapper/0: kasan_report+0xec/0x14c swapper/0: __do_kernel_fault+0x60/0x29c swapper/0: do_bad_area+0x30/0xdc swapper/0: do_tag_check_fault+0x20/0x34 swapper/0: do_mem_abort+0x58/0x104 swapper/0: el1_abort+0x3c/0x5c swapper/0: el1h_64_sync_handler+0x80/0xcc swapper/0: el1h_64_sync+0x68/0x6c swapper/0: __memset+0x84/0x188 swapper/0: btf_populate_kfunc_set+0x280/0x3d8 swapper/0: __register_btf_kfunc_id_set+0x43c/0x468 swapper/0: register_btf_kfunc_id_set+0x48/0x60 swapper/0: register_nf_nat_bpf+0x1c/0x40 swapper/0: nf_nat_init+0xc0/0x128 swapper/0: do_one_initcall+0x184/0x464 swapper/0: do_initcall_level+0xdc/0x1b0 swapper/0: do_initcalls+0x70/0xc0 swapper/0: do_basic_setup+0x1c/0x28 swapper/0: kernel_init_freeable+0x144/0x1b8 swapper/0: kernel_init+0x20/0x1a8 swapper/0: ret_from_fork+0x10/0x20 ================================================================== Fixes: 1a83a716ec233 ("mm: krealloc: consider spare memory for __GFP_ZERO") Signed-off-by: Qun-Wei Lin <qun-wei.lin@mediatek.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
Barry Song
|
01626a1823 |
mm: avoid unconditional one-tick sleep when swapcache_prepare fails
Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") introduced an unconditional one-tick sleep when `swapcache_prepare()` fails, which has led to reports of UI stuttering on latency-sensitive Android devices. To address this, we can use a waitqueue to wake up tasks that fail `swapcache_prepare()` sooner, instead of always sleeping for a full tick. While tasks may occasionally be woken by an unrelated `do_swap_page()`, this method is preferable to two scenarios: rapid re-entry into page faults, which can cause livelocks, and multiple millisecond sleeps, which visibly degrade user experience. Oven's testing shows that a single waitqueue resolves the UI stuttering issue. If a 'thundering herd' problem becomes apparent later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]` for page bit locks can be introduced. [v-songbaohua@oppo.com: wake_up only when swapcache_wq waitqueue is active] Link: https://lkml.kernel.org/r/20241008130807.40833-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240926211936.75373-1-21cnbao@gmail.com Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reported-by: Oven Liyang <liyangouwen1@oppo.com> Tested-by: Oven Liyang <liyangouwen1@oppo.com> Cc: Kairui Song <kasong@tencent.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Yu Zhao <yuzhao@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kirill A. Shutemov
|
58a039e679 |
mm: split critical region in remap_file_pages() and invoke LSMs in between
Commit ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in remap_file_pages()") fixed a security issue, it added an LSM check when trying to remap file pages, so that LSMs have the opportunity to evaluate such action like for other memory operations such as mmap() and mprotect(). However, that commit called security_mmap_file() inside the mmap_lock lock, while the other calls do it before taking the lock, after commit 8b3ec6814c83 ("take security_mmap_file() outside of ->mmap_sem"). This caused lock inversion issue with IMA which was taking the mmap_lock and i_mutex lock in the opposite way when the remap_file_pages() system call was called. Solve the issue by splitting the critical region in remap_file_pages() in two regions: the first takes a read lock of mmap_lock, retrieves the VMA and the file descriptor associated, and calculates the 'prot' and 'flags' variables; the second takes a write lock on mmap_lock, checks that the VMA flags and the VMA file descriptor are the same as the ones obtained in the first critical region (otherwise the system call fails), and calls do_mmap(). In between, after releasing the read lock and before taking the write lock, call security_mmap_file(), and solve the lock inversion issue. Link: https://lkml.kernel.org/r/20241018161415.3845146-1-roberto.sassu@huaweicloud.com Fixes: ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in remap_file_pages()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reported-by: syzbot+1cd571a672400ef3a930@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-security-module/66f7b10e.050a0220.46d20.0036.GAE@google.com/ Tested-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Tested-by: syzbot+1cd571a672400ef3a930@syzkaller.appspotmail.com Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com> Cc: Eric Snowberg <eric.snowberg@oracle.com> Cc: James Morris <jmorris@namei.org> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Shu Han <ebpqwerty472123@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
c4d91e225f |
mm/vma: add expand-only VMA merge mode and optimise do_brk_flags()
Patch series "introduce VMA merge mode to improve brk() performance". A ~5% performance regression was discovered on the aim9.brk_test.ops_per_sec by the linux kernel test bot [0]. In the past to satisfy brk() performance we duplicated VMA expansion code and special-cased do_brk_flags(). This is however horrid and undoes work to abstract this logic, so in resolving the issue I have endeavoured to avoid this. Investigating further I was able to observe that the use of a vma_iter_next_range() and vma_prev() pair, causing an unnecessary maple tree walk. In addition there is work that we do that is simply unnecessary for brk(). Therefore, add a special VMA merge mode VMG_FLAG_JUST_EXPAND to avoid doing any of this - it assumes the VMA iterator is pointing at the previous VMA and which skips logic that brk() does not require. This mostly eliminates the performance regression reducing it to ~2% which is in the realm of noise. In addition, the will-it-scale test brk2, written to be more representative of real-world brk() usage, shows a modest performance improvement - which gives me confidence that we are not meaningfully regressing real workloads here. This series includes a test asserting that the 'just expand' mode works as expected. With many thanks to Oliver Sang for helping with performance testing of candidate patch sets! [0]:https://lore.kernel.org/linux-mm/202409301043.629bea78-oliver.sang@intel.com This patch (of 2): We know in advance that do_brk_flags() wants only to perform a VMA expansion (if the prior VMA is compatible), and that we assume no mergeable VMA follows it. These are the semantics of this function prior to the recent rewrite of the VMA merging logic, however we are now doing more work than necessary - positioning the VMA iterator at the prior VMA and performing tasks that are not required. Add a new field to the vmg struct to permit merge flags and add a new merge flag VMG_FLAG_JUST_EXPAND which implies this behaviour, and have do_brk_flags() use this. This fixes a reported performance regression in a brk() benchmarking suite. Link: https://lkml.kernel.org/r/cover.1729174352.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/4e65d4395e5841c5acf8470dbcb714016364fd39.1729174352.git.lorenzo.stoakes@oracle.com Fixes: cacded5e42b9 ("mm: avoid using vma_merge() for new VMAs") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/linux-mm/202409301043.629bea78-oliver.sang@intel.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Nobuhiro Iwamatsu
|
d95fb348f0 |
mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id
The acquired memory blocks for reserved may include blocks outside of memory management. In this case, the nid variable is set to NUMA_NO_NODE (-1), so an error occurs in node_set(). This adds a check using numa_valid_node() to numa_clear_kernel_node_hotplug() that skips node_set() when nid is set to NUMA_NO_NODE. Link: https://lkml.kernel.org/r/1729070461-13576-1-git-send-email-nobuhiro1.iwamatsu@toshiba.co.jp Fixes: 87482708210f ("mm: introduce numa_memblks") Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Yuji Ishikawa <yuji2.ishikawa@toshiba.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jeongjun Park
|
d949d1d14f |
mm: shmem: fix data-race in shmem_getattr()
I got the following KCSAN report during syzbot testing: ================================================================== BUG: KCSAN: data-race in generic_fillattr / inode_set_ctime_current write to 0xffff888102eb3260 of 4 bytes by task 6565 on cpu 1: inode_set_ctime_to_ts include/linux/fs.h:1638 [inline] inode_set_ctime_current+0x169/0x1d0 fs/inode.c:2626 shmem_mknod+0x117/0x180 mm/shmem.c:3443 shmem_create+0x34/0x40 mm/shmem.c:3497 lookup_open fs/namei.c:3578 [inline] open_last_lookups fs/namei.c:3647 [inline] path_openat+0xdbc/0x1f00 fs/namei.c:3883 do_filp_open+0xf7/0x200 fs/namei.c:3913 do_sys_openat2+0xab/0x120 fs/open.c:1416 do_sys_open fs/open.c:1431 [inline] __do_sys_openat fs/open.c:1447 [inline] __se_sys_openat fs/open.c:1442 [inline] __x64_sys_openat+0xf3/0x120 fs/open.c:1442 x64_sys_call+0x1025/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:258 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e read to 0xffff888102eb3260 of 4 bytes by task 3498 on cpu 0: inode_get_ctime_nsec include/linux/fs.h:1623 [inline] inode_get_ctime include/linux/fs.h:1629 [inline] generic_fillattr+0x1dd/0x2f0 fs/stat.c:62 shmem_getattr+0x17b/0x200 mm/shmem.c:1157 vfs_getattr_nosec fs/stat.c:166 [inline] vfs_getattr+0x19b/0x1e0 fs/stat.c:207 vfs_statx_path fs/stat.c:251 [inline] vfs_statx+0x134/0x2f0 fs/stat.c:315 vfs_fstatat+0xec/0x110 fs/stat.c:341 __do_sys_newfstatat fs/stat.c:505 [inline] __se_sys_newfstatat+0x58/0x260 fs/stat.c:499 __x64_sys_newfstatat+0x55/0x70 fs/stat.c:499 x64_sys_call+0x141f/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:263 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e value changed: 0x2755ae53 -> 0x27ee44d3 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 3498 Comm: udevd Not tainted 6.11.0-rc6-syzkaller-00326-gd1f2d51b711a-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 ================================================================== When calling generic_fillattr(), if you don't hold read lock, data-race will occur in inode member variables, which can cause unexpected behavior. Since there is no special protection when shmem_getattr() calls generic_fillattr(), data-race occurs by functions such as shmem_unlink() or shmem_mknod(). This can cause unexpected results, so commenting it out is not enough. Therefore, when calling generic_fillattr() from shmem_getattr(), it is appropriate to protect the inode using inode_lock_shared() and inode_unlock_shared() to prevent data-race. Link: https://lkml.kernel.org/r/20240909123558.70229-1-aha310510@gmail.com Fixes: 44a30220bc0a ("shmem: recalculate file inode when fstat") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reported-by: syzbot <syzkaller@googlegroup.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jann Horn
|
14611508cb |
mm: mark mas allocation in vms_abort_munmap_vmas as __GFP_NOFAIL
vms_abort_munmap_vmas() is a recovery path where, on entry, some VMAs have already been torn down halfway (in a way we can't undo) but are still present in the maple tree. At this point, we *must* remove the VMAs from the VMA tree, otherwise we get UAF. Because removing VMA tree nodes can require memory allocation, the existing code has an error path which tries to handle this by reattaching the VMAs; but that can't be done safely. A nicer way to fix it would probably be to preallocate enough maple tree nodes for the removal before the point of no return, or something like that; but for now, fix it the easy and kinda ugly way, by marking this allocation __GFP_NOFAIL. Link: https://lkml.kernel.org/r/20241016-fix-munmap-abort-v1-1-601c94b2240d@google.com Fixes: 4f87153e82c4 ("mm: change failure of MAP_FIXED to restoring the gap on failure") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
b7c5f9a1fb |
resource: remove dependency on SPARSEMEM from GET_FREE_REGION
We want to use the functions (get_free_mem_region()) configured via GET_FREE_REGION in resource kunit tests. However, GET_FREE_REGION depends on SPARSEMEM now. This makes resource kunit tests cannot be built on some architectures lacking SPARSEMEM, or causes config warning as follows, WARNING: unmet direct dependencies detected for GET_FREE_REGION Depends on [n]: SPARSEMEM [=n] Selected by [y]: - RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y] When get_free_mem_region() was introduced the only consumers were those looking to pass the address range to memremap_pages(). That address range needed to be mindful of the maximum addressable platform physical address which at the time only SPARSMEM defined via MAX_PHYSMEM_BITS. Given that memremap_pages() also depended on SPARSEMEM via ZONE_DEVICE, it was easier to just depend on that definition than invent a general MAX_PHYSMEM_BITS concept outside of SPARSEMEM. Turns out that decision was buggy and did not account for KASAN consumption of physical address space. That problem was resolved recently with commit ea72ce5da228 ("x86/kaslr: Expose and use the end of the physical memory address space"), and GET_FREE_REGION dropped its MAX_PHYSMEM_BITS dependency. Then commit 99185c10d5d9 ("resource, kunit: add test case for region_intersects()"), went ahead and fixed up the only remaining dependency on SPARSEMEM which was usage of the PA_SECTION_SHIFT macro for setting the default alignment. A PAGE_SIZE fallback is fine in the SPARSEMEM=n case. With those build dependencies gone GET_FREE_REGION no longer depends on SPARSEMEM. So, the patch removes dependency on SPARSEMEM from GET_FREE_REGION to fix the build issues. Link: https://lkml.kernel.org/r/20241016014730.339369-1-ying.huang@intel.com Link: https://lore.kernel.org/lkml/20240922225041.603186-1-linux@roeck-us.net/ Link: https://lkml.kernel.org/r/20241015051554.294734-1-ying.huang@intel.com Fixes: 99185c10d5d9 ("resource, kunit: add test case for region_intersects()") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Co-developed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Nathan Chancellor <nathan@kernel.org> # build Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
79f3d123ca |
mm/mmap: fix race in mmap_region() with ftruncate()
Avoiding the zeroing of the vma tree in mmap_region() introduced a race with truncate in the page table walk. To avoid any races, create a hole in the rmap during the operation by clearing the pagetable entries earlier under the mmap write lock and (critically) before the new vma is installed into the vma tree. The result is that the old vma(s) are left in the vma tree, but free_pgtables() removes them from the rmap and clears the ptes while holding the necessary locks. This change extends the fix required for hugetblfs and the call_mmap() function by moving the cleanup higher in the function and running it unconditionally. Link: https://lkml.kernel.org/r/20241016013455.2241533-1-Liam.Howlett@oracle.com Fixes: f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/ Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |