20669 Commits

Author SHA1 Message Date
Huang Ying
3ecdeb0f87 swap: remove __swp_swapcount()
__swp_swapcount() just encloses the calling to swap_swapcount() with
get/put_swap_device().  It is called in __read_swap_cache_async() only,
which encloses the calling with get/put_swap_device() already.  So,
__read_swap_cache_async() can call swap_swapcount() directly.

Link: https://lkml.kernel.org/r/20230529061355.125791-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Chris Li (Google) <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:49 -07:00
Huang Ying
46a774d3ea swap, __read_swap_cache_async(): enlarge get/put_swap_device protection range
This makes the function a little easier to be understood because we don't
need to consider swapoff.  And this makes it possible to remove
get/put_swap_device() calling in some functions called by
__read_swap_cache_async().

Link: https://lkml.kernel.org/r/20230529061355.125791-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Chris Li (Google) <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:49 -07:00
Huang Ying
f9f956b550 swap: remove get/put_swap_device() in __swap_count()
Patch series "swap: cleanup get/put_swap_device() usage", v3.

The general rule to use a swap entry is as follows.

When we get a swap entry, if there aren't some other ways to prevent
swapoff, such as the folio in swap cache is locked, page table lock is
held, etc., the swap entry may become invalid because of swapoff.  Then,
we need to enclose all swap related functions with get_swap_device() and
put_swap_device(), unless the swap functions call get/put_swap_device() by
themselves.

Based on the above rule, all get/put_swap_device() usage are checked and
cleaned up if necessary.


This patch (of 5):

get/put_swap_device() are added to __swap_count() in commit
eb085574a752 ("mm, swap: fix race between swapoff and some swap
operations").  Later, in commit 2799e77529c2 ("swap: fix
do_swap_page() race with swapoff"), get/put_swap_device() are added to
do_swap_page().  And they enclose the only call site of
__swap_count().  So, it's safe to remove get/put_swap_device() in
__swap_count() now.

Link: https://lkml.kernel.org/r/20230529061355.125791-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20230529061355.125791-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Chris Li (Google) <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:49 -07:00
Haifeng Xu
1c2d252f5b mm/mm_init.c: do not calculate zone_start_pfn/zone_end_pfn in zone_absent_pages_in_node()
In calculate_node_totalpages(), zone_start_pfn/zone_end_pfn are already
calculated in zone_spanned_pages_in_node(), so use them as parameters
instead of node_start_pfn/node_end_pfn and the duplicated calculation
process can de dropped.

Link: https://lkml.kernel.org/r/20230526085251.1977-2-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Haifeng Xu <haifeng.xu@shopee.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:49 -07:00
Haifeng Xu
ba1b67c79c mm/mm_init.c: introduce reset_memoryless_node_totalpages()
Currently, no matter whether a node actually has memory or not,
calculate_node_totalpages() is used to account number of pages in
zone/node.  However, for node without memory, these unnecessary
calculations can be skipped.  All the zone/node page counts can be set to
0 directly.  So introduce reset_memoryless_node_totalpages() to perform
this action.

Furthermore, calculate_node_totalpages() only gets called for the node
with memory.

Link: https://lkml.kernel.org/r/20230526085251.1977-1-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:48 -07:00
Kalesh Singh
3af0191a59 Multi-gen LRU: fix workingset accounting
On Android app cycle workloads, MGLRU showed a significant reduction in
workingset refaults although pgpgin/pswpin remained relatively unchanged. 
This indicated MGLRU may be undercounting workingset refaults.

This has impact on userspace programs, like Android's LMKD, that monitor
workingset refault statistics to detect thrashing.

It was found that refaults were only accounted if the MGLRU shadow entry
was for a recently evicted folio.  However, recently evicted folios should
be accounted as workingset activation, and refaults should be accounted
regardless of recency.

Fix MGLRU's workingset refault and activation accounting to more closely
match that of the conventional active/inactive LRU.

Link: https://lkml.kernel.org/r/20230523205922.3852731-1-kaleshsingh@google.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:46 -07:00
Lars R. Damerow
e0e0b4126c mm/memcontrol: export memcg.swap watermark via sysfs for v2 memcg
This patch is similar to commit 8e20d4b33266 ("mm/memcontrol: export
memcg->watermark via sysfs for v2 memcg"), but exports the swap counter's
watermark.

We allocate jobs to our compute farm using heuristics determined by memory
and swap usage from previous jobs.  Tracking the peak swap usage for new
jobs is important for determining when jobs are exceeding their expected
bounds, or when our baseline metrics are getting outdated.

Our toolset was written to use the "memory.memsw.max_usage_in_bytes" file
in cgroups v1, and altering it to poll cgroups v2's "memory.swap.current"
would give less accurate results as well as add complication to the code. 
Having this watermark exposed in sysfs is much preferred.

Link: https://lkml.kernel.org/r/20230524181734.125696-1-lars@pixar.com
Signed-off-by: Lars R. Damerow <lars@pixar.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:43 -07:00
Tu Jinjiang
283ebdee2d mm: shmem: fix UAF bug in shmem_show_options()
shmem_show_options() uses sbinfo->mpol without adding it's refcnt. This
may lead to race with replacement of the mpol by remount. The execution
sequence is as follows.

       CPU0                                   CPU1
shmem_show_options()                        shmem_reconfigure()
    shmem_show_mpol(seq, sbinfo->mpol)          mpol = sbinfo->mpol
                                                mpol_put(mpol)
        mpol->mode

The KASAN report is as follows.

BUG: KASAN: slab-use-after-free in shmem_show_options+0x21b/0x340
Read of size 2 at addr ffff888124324004 by task mount/2388

CPU: 2 PID: 2388 Comm: mount Not tainted 6.4.0-rc3-00017-g9d646009f65d-dirty #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x37/0x50
 print_report+0xd0/0x620
 ? shmem_show_options+0x21b/0x340
 ? __virt_addr_valid+0xf4/0x180
 ? shmem_show_options+0x21b/0x340
 kasan_report+0xb8/0xe0
 ? shmem_show_options+0x21b/0x340
 shmem_show_options+0x21b/0x340
 ? __pfx_shmem_show_options+0x10/0x10
 ? strchr+0x2c/0x50
 ? strlen+0x23/0x40
 ? seq_puts+0x7d/0x90
 show_vfsmnt+0x1e6/0x260
 ? __pfx_show_vfsmnt+0x10/0x10
 ? __kasan_kmalloc+0x7f/0x90
 seq_read_iter+0x57a/0x740
 vfs_read+0x2e2/0x4a0
 ? __pfx_vfs_read+0x10/0x10
 ? down_write_killable+0xb8/0x140
 ? __pfx_down_write_killable+0x10/0x10
 ? __fget_light+0xa9/0x1e0
 ? up_write+0x3f/0x80
 ksys_read+0xb8/0x150
 ? __pfx_ksys_read+0x10/0x10
 ? fpregs_assert_state_consistent+0x55/0x60
 ? exit_to_user_mode_prepare+0x2d/0x120
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

 </TASK>

Allocated by task 2387:
 kasan_save_stack+0x22/0x50
 kasan_set_track+0x25/0x30
 __kasan_slab_alloc+0x59/0x70
 kmem_cache_alloc+0xdd/0x220
 mpol_new+0x83/0x150
 mpol_parse_str+0x280/0x4a0
 shmem_parse_one+0x364/0x520
 vfs_parse_fs_param+0xf8/0x1a0
 vfs_parse_fs_string+0xc9/0x130
 shmem_parse_options+0xb2/0x110
 path_mount+0x597/0xdf0
 do_mount+0xcd/0xf0
 __x64_sys_mount+0xbd/0x100
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 2389:
 kasan_save_stack+0x22/0x50
 kasan_set_track+0x25/0x30
 kasan_save_free_info+0x2e/0x50
 __kasan_slab_free+0x10e/0x1a0
 kmem_cache_free+0x9c/0x350
 shmem_reconfigure+0x278/0x370
 reconfigure_super+0x383/0x450
 path_mount+0xcc5/0xdf0
 do_mount+0xcd/0xf0
 __x64_sys_mount+0xbd/0x100
 do_syscall_64+0x3c/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

The buggy address belongs to the object at ffff888124324000
 which belongs to the cache numa_policy of size 32
The buggy address is located 4 bytes inside of
 freed 32-byte region [ffff888124324000, ffff888124324020)
==================================================================

To fix the bug, shmem_get_sbmpol() / mpol_put() needs to be called
before / after shmem_show_mpol() call.

Link: https://lkml.kernel.org/r/20230525031640.593733-1-tujinjiang@huawei.com
Signed-off-by: Tu Jinjiang <tujinjiang@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:43 -07:00
Baolin Wang
a8d13355c6 mm: compaction: skip fast freepages isolation if enough freepages are isolated
I've observed that fast isolation often isolates more pages than
cc->migratepages, and the excess freepages will be released back to the
buddy system.  So skip fast freepages isolation if enough freepages are
isolated to save some CPU cycles.

Link: https://lkml.kernel.org/r/f39c2c07f2dba2732fd9c0843572e5bef96f7f67.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:43 -07:00
Baolin Wang
447ba88658 mm: compaction: add trace event for fast freepages isolation
The fast_isolate_freepages() can also isolate freepages, but we can not
know the fast isolation efficiency to understand the fast isolation
pressure.  So add a trace event to show some numbers to help to understand
the efficiency for fast freepages isolation.

Link: https://lkml.kernel.org/r/78d2932d0160d122c15372aceb3f2c45460a17fc.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:43 -07:00
Baolin Wang
8b71b499ff mm: compaction: only set skip flag if cc->no_set_skip_hint is false
To keep the same logic as test_and_set_skip(), only set the skip flag if
cc->no_set_skip_hint is false, which makes code more reasonable.

Link: https://lkml.kernel.org/r/0eb2cd2407ffb259ae6e3071e10f70f2d41d0f3e.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:42 -07:00
Baolin Wang
cf650342f8 mm: compaction: skip more fully scanned pageblock
In fast_isolate_around(), it assumes the pageblock is fully scanned if
cc->nr_freepages < cc->nr_migratepages after trying to isolate some free
pages, and will set skip flag to avoid scanning in future.  However this
can miss setting the skip flag for a fully scanned pageblock (returned
'start_pfn' is equal to 'end_pfn') in the case where cc->nr_freepages is
larger than cc->nr_migratepages.

So using the returned 'start_pfn' from isolate_freepages_block() and
'end_pfn' to decide if a pageblock is fully scanned makes more sense.  It
can also cover the case where cc->nr_freepages < cc->nr_migratepages,
which means the 'start_pfn' is usually equal to 'end_pfn' except some
uncommon fatal error occurs after non-strict mode isolation.

Link: https://lkml.kernel.org/r/f4efd2fa08735794a6d809da3249b6715ba6ad38.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:42 -07:00
Baolin Wang
2dbd90054f mm: compaction: change fast_isolate_freepages() to void type
No caller cares about the return value of fast_isolate_freepages(), void
it.

Link: https://lkml.kernel.org/r/759fca20b22ebf4c81afa30496837b9e0fb2e53b.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:42 -07:00
Baolin Wang
75990f6459 mm: compaction: drop the redundant page validation in update_pageblock_skip()
Patch series "Misc cleanups and improvements for compaction".

This series cantains some cleanups and improvements for compaction.


This patch (of 6):

The caller has validated the page before calling
update_pageblock_skip(), thus drop the redundant page validation in
update_pageblock_skip().

Link: https://lkml.kernel.org/r/5142e15b9295fe8c447dbb39b7907a20177a1413.1685018752.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:42 -07:00
Thomas Gleixner
77e50af07f mm/vmalloc: dont purge usable blocks unnecessarily
Purging fragmented blocks is done unconditionally in several contexts:

  1) From drain_vmap_area_work(), when the number of lazy to be freed
     vmap_areas reached the threshold

  2) Reclaiming vmalloc address space from pcpu_get_vm_areas()

  3) _vm_unmap_aliases()

#1 There is no reason to zap fragmented vmap blocks unconditionally, simply
   because reclaiming all lazy areas drains at least

      32MB * fls(num_online_cpus())

   per invocation which is plenty.

#2 Reclaiming when running out of space or due to memory pressure makes a
   lot of sense

#3 _unmap_aliases() requires to touch everything because the caller has no
   clue which vmap_area used a particular page last and the vmap_area lost
   that information too.

   Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the
   vmap area first and then cares about the flush. That in turn requires
   a full walk of _all_ vmap areas including the one which was just
   added to the purge list.

   But as this has to be flushed anyway this is an opportunity to combine
   outstanding TLB flushes and do the housekeeping of purging freed areas,
   but like #1 there is no real good reason to zap usable vmap blocks
   unconditionally.

Add a @force_purge argument to the newly split out block purge function and
if not true only purge fragmented blocks which have less than 1/4 of their
capacity left.

Rename purge_vmap_area_lazy() to reclaim_and_purge_vmap_areas() to make it
clear what the function does.

[lstoakes@gmail.com: correct VMAP_PURGE_THRESHOLD check]
  Link: https://lkml.kernel.org/r/3e92ef61-b910-4576-88e7-cf43211fd4e7@lucifer.local
Link: https://lkml.kernel.org/r/20230525124504.864005691@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:41 -07:00
Thomas Gleixner
7f48121e9f mm/vmalloc: add missing READ/WRITE_ONCE() annotations
purge_fragmented_blocks() accesses vmap_block::free and vmap_block::dirty
lockless for a quick check.

Add the missing READ/WRITE_ONCE() annotations.

Link: https://lkml.kernel.org/r/20230525124504.807356682@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:41 -07:00
Thomas Gleixner
43d7650234 mm/vmalloc: check free space in vmap_block lockless
vb_alloc() unconditionally locks a vmap_block on the free list to check
the free space.

This can be done locklessly because vmap_block::free never increases, it's
only decreased on allocations.

Check the free space lockless and only if that succeeds, recheck under the
lock.

Link: https://lkml.kernel.org/r/20230525124504.750481992@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:41 -07:00
Thomas Gleixner
a09fad96ff mm/vmalloc: prevent flushing dirty space over and over
vmap blocks which have active mappings cannot be purged.  Allocations
which have been freed are accounted for in vmap_block::dirty_min/max, so
that they can be detected in _vm_unmap_aliases() as potentially stale
TLBs.

If there are several invocations of _vm_unmap_aliases() then each of them
will flush the dirty range.  That's pointless and just increases the
probability of full TLB flushes.

Avoid that by resetting the flush range after accounting for it.  That's
safe versus other invocations of _vm_unmap_aliases() because this is all
serialized with vmap_purge_lock.

Link: https://lkml.kernel.org/r/20230525124504.692056496@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:41 -07:00
Thomas Gleixner
ca5e46c340 mm/vmalloc: avoid iterating over per CPU vmap blocks twice
_vunmap_aliases() walks the per CPU xarrays to find partially unmapped
blocks and then walks the per cpu free lists to purge fragmented blocks.

Arguably that's waste of CPU cycles and cache lines as the full xarray
walk already touches every block.

Avoid this double iteration:

  - Split out the code to purge one block and the code to free the local
    purge list into helper functions.

  - Try to purge the fragmented blocks in the xarray walk before looking at
    their dirty space.

Link: https://lkml.kernel.org/r/20230525124504.633469722@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:40 -07:00
Thomas Gleixner
fc1e0d9800 mm/vmalloc: prevent stale TLBs in fully utilized blocks
Patch series "mm/vmalloc: Assorted fixes and improvements", v2.

this series addresses the following issues:

  1) Prevent the stale TLB problem related to fully utilized vmap blocks

  2) Avoid the double per CPU list walk in _vm_unmap_aliases()

  3) Avoid flushing dirty space over and over

  4) Add a lockless quickcheck in vb_alloc() and add missing
     READ/WRITE_ONCE() annotations

  5) Prevent overeager purging of usable vmap_blocks if
     not under memory/address space pressure.


This patch (of 6):

_vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
page are left in the system. This is required due to the lazy TLB flush
mechanism in vmalloc.

This is tried to achieve by walking the per CPU free lists, but those do
not contain fully utilized vmap blocks because they are removed from the
free list once the blocks free space became zero.

When the block is not fully unmapped then it is not on the purge list
either.

So neither the per CPU list iteration nor the purge list walk find the
block and if the page was mapped via such a block and the TLB has not yet
been flushed, the guarantee of _vm_unmap_aliases() that there are no stale
TLBs after returning is broken:

x = vb_alloc() // Removes vmap_block from free list because vb->free became 0
vb_free(x)     // Unmaps page and marks in dirty_min/max range
	       // Block has still mappings and is not put on purge list

// Page is reused
vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL

So instead of walking the per CPU free lists, walk the per CPU xarrays
which hold pointers to _all_ active blocks in the system including those
removed from the free lists.

Link: https://lkml.kernel.org/r/20230525122342.109672430@linutronix.de
Link: https://lkml.kernel.org/r/20230525124504.573987880@linutronix.de
Fixes: db64fe02258f ("mm: rewrite vmap layer")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:40 -07:00
T.J. Alumbaugh
d7f1afd0e3 mm: multi-gen LRU: cleanup lru_gen_test_recent()
Avoid passing memcg* and pglist_data* to lru_gen_test_recent()
since we only use the lruvec anyway.

Link: https://lkml.kernel.org/r/20230522112058.2965866-4-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Reviewed-by: Yuanchu Xie <yuanchu@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:40 -07:00
T.J. Alumbaugh
bd02df412c mm: multi-gen LRU: add helpers in page table walks
Add helpers to page table walking code:
 - Clarifies intent via name "should_walk_mmu" and "should_clear_pmd_young"
 - Avoids repeating same logic in two places

Link: https://lkml.kernel.org/r/20230522112058.2965866-3-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Reviewed-by: Yuanchu Xie <yuanchu@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:40 -07:00
T.J. Alumbaugh
5c7e7a0d79 mm: multi-gen LRU: cleanup lru_gen_soft_reclaim()
lru_gen_soft_reclaim() gets the lruvec from the memcg and node ID to keep a
cleaner interface on the caller side.

Link: https://lkml.kernel.org/r/20230522112058.2965866-2-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Reviewed-by: Yuanchu Xie <yuanchu@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:39 -07:00
T.J. Alumbaugh
0285762c6f mm: multi-gen LRU: use macro for bitmap
Use DECLARE_BITMAP macro when possible.

Link: https://lkml.kernel.org/r/20230522112058.2965866-1-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:39 -07:00
Haifeng Xu
08e0f49e99 mm/memcontrol: fix typo in comment
Replace 'then' with 'than'.

Link: https://lkml.kernel.org/r/20230522095233.4246-1-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:39 -07:00
Andrew Morton
b0cc5e89ca mm/mlock: rename mlock_future_check() to mlock_future_ok()
It is felt that the name mlock_future_check() is vague - it doesn't
particularly convey the function's operation.  mlock_future_ok() is a
clearer name for a predicate function.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:38 -07:00
Lorenzo Stoakes
3c54a298db mm/mmap: refactor mlock_future_check()
In all but one instance, mlock_future_check() is treated as a boolean
function despite returning an error code.  In one instance, this error
code is ignored and replaced with -ENOMEM.

This is confusing, and the inversion of true -> failure, false -> success
is not warranted.  Convert the function to a bool, lightly refactor and
return true if the check passes, false if not.

Link: https://lkml.kernel.org/r/20230522082412.56685-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:38 -07:00
Johannes Weiner
4fbbb3fde3 mm: compaction: avoid GFP_NOFS ABBA deadlock
During stress testing with higher-order allocations, a deadlock scenario
was observed in compaction: One GFP_NOFS allocation was sleeping on
mm/compaction.c::too_many_isolated(), while all CPUs in the system were
busy with compactors spinning on buffer locks held by the sleeping
GFP_NOFS allocation.

Reclaim is susceptible to this same deadlock; we fixed it by granting
GFP_NOFS allocations additional LRU isolation headroom, to ensure it makes
forward progress while holding fs locks that other reclaimers might
acquire.  Do the same here.

This code has been like this since compaction was initially merged, and I
only managed to trigger this with out-of-tree patches that dramatically
increase the contexts that do GFP_NOFS compaction.  While the issue is
real, it seems theoretical in nature given existing allocation sites. 
Worth fixing now, but no Fixes tag or stable CC.

Link: https://lkml.kernel.org/r/20230519111359.40475-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:37 -07:00
Johannes Weiner
3cf0493752 mm: compaction: have compaction_suitable() return bool
Since it only returns COMPACT_CONTINUE or COMPACT_SKIPPED now, a bool
return value simplifies the callsites.

Link: https://lkml.kernel.org/r/20230602151204.GD161817@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:37 -07:00
Johannes Weiner
1c9568e806 mm: compaction: drop redundant watermark check in compaction_zonelist_suitable()
The watermark check in compaction_zonelist_suitable(), called from
should_compact_retry(), is sandwiched between two watermark checks
already: before, there are freelist attempts as part of direct reclaim and
direct compaction; after, there is a last-minute freelist attempt in
__alloc_pages_may_oom().

The check in compaction_zonelist_suitable() isn't necessary. Kill it.

Link: https://lkml.kernel.org/r/20230519123959.77335-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:37 -07:00
Johannes Weiner
f98a497e1f mm: compaction: remove unnecessary is_via_compact_memory() checks
Remove from all paths not reachable via /proc/sys/vm/compact_memory.

Link: https://lkml.kernel.org/r/20230519123959.77335-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:36 -07:00
Johannes Weiner
e8606320e9 mm: compaction: refactor __compaction_suitable()
__compaction_suitable() is supposed to check for available migration
targets.  However, it also checks whether the operation was requested via
/proc/sys/vm/compact_memory, and whether the original allocation request
can already succeed.  These don't apply to all callsites.

Move the checks out to the callers, so that later patches can deal with
them one by one.  No functional change intended.

[hannes@cmpxchg.org: fix comment, per Vlastimil]
  Link: https://lkml.kernel.org/r/20230602144942.GC161817@cmpxchg.org
Link: https://lkml.kernel.org/r/20230519123959.77335-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:36 -07:00
Johannes Weiner
511a69b27f mm: compaction: simplify should_compact_retry()
The different branches for retry are unnecessarily complicated.  There are
really only three outcomes: progress (retry n times), skipped (retry if
reclaim can help), failed (retry with higher priority).

Rearrange the branches and the retry counter to make it simpler.

[hannes@cmpxchg.org: restore behavior when hitting max_retries]
  Link: https://lkml.kernel.org/r/20230602144705.GB161817@cmpxchg.org
Link: https://lkml.kernel.org/r/20230519123959.77335-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:36 -07:00
Johannes Weiner
ecd8b2928f mm: compaction: remove compaction result helpers
Patch series "mm: compaction: cleanups & simplifications".

These compaction cleanups are split out from the huge page allocator
series[1], as requested by reviewer feedback.

[1] https://lore.kernel.org/linux-mm/20230418191313.268131-1-hannes@cmpxchg.org/


This patch (of 5):

The compaction result helpers encode quirks that are specific to the
allocator's retry logic.  E.g.  COMPACT_SUCCESS and COMPACT_COMPLETE
actually represent failures that should be retried upon, and so on.  I
frequently found myself pulling up the helper implementation in order to
understand and work on the retry logic.  They're not quite clean
abstractions; rather they split the retry logic into two locations.

Remove the helpers and inline the checks.  Then comment on the result
interpretations directly where the decision making happens.

Link: https://lkml.kernel.org/r/20230519123959.77335-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20230519123959.77335-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:36 -07:00
Tom Rix
62069aace1 mm: page_alloc: set sysctl_lowmem_reserve_ratio storage-class-specifier to static
smatch reports
mm/page_alloc.c:247:5: warning: symbol
  'sysctl_lowmem_reserve_ratio' was not declared. Should it be static?

This variable is only used in its defining file, so it should be static

Link: https://lkml.kernel.org/r/20230518141119.927074-1-trix@redhat.com
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:35 -07:00
Liam R. Howlett
5c1c03de1b mm: avoid rewalk in mmap_region
If the iterator has moved to the previous entry, then step forward one
range, back to the gap.

Link: https://lkml.kernel.org/r/20230518145544.1722059-36-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:35 -07:00
Liam R. Howlett
15c0c60b8c mm/mmap: change do_vmi_align_munmap() for maple tree iterator changes
The maple tree iterator clean up is incompatible with the way
do_vmi_align_munmap() expects it to behave.  Update the expected behaviour
to map now since the change will work currently.

Link: https://lkml.kernel.org/r/20230518145544.1722059-23-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:32 -07:00
Liam R. Howlett
36bd931049 mm: update vma_iter_store() to use MAS_WARN_ON()
MAS_WARN_ON() will provide more information on the maple state and can be
more useful for debugging.  Use this version of WARN_ON() in the debugging
code when storing to the tree.

Update the printk to a pr_warn(), but this will only be printed when maple
tree debug is enabled anyways.

Making all print statements into one will keep them together on a busy
terminal.

Link: https://lkml.kernel.org/r/20230518145544.1722059-19-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:31 -07:00
Liam R. Howlett
b50e195ff4 mm: update validate_mm() to use vma iterator
Use the vma iterator in the validation code and combine the code to check
the maple tree into the main validate_mm() function.

Introduce a new function vma_iter_dump_tree() to dump the maple tree in
hex layout.

Replace all calls to validate_mm_mt() with validate_mm().

[Liam.Howlett@oracle.com: update validate_mm() to use vma iterator CONFIG flag]
  Link: https://lkml.kernel.org/r/20230606183538.588190-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20230518145544.1722059-18-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:31 -07:00
Liam R. Howlett
89f499f35c maple_tree: add format option to mt_dump()
Allow different formatting strings to be used when dumping the tree. 
Currently supports hex and decimal.

Link: https://lkml.kernel.org/r/20230518145544.1722059-6-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:28 -07:00
Matthew Wilcox (Oracle)
4e096ae180 mm: convert migrate_pages() to work on folios
Almost all of the callers & implementors of migrate_pages() were already
converted to use folios.  compaction_alloc() & compaction_free() are
trivial to convert a part of this patch and not worth splitting out.

Link: https://lkml.kernel.org/r/20230513001101.276972-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:27 -07:00
Lorenzo Stoakes
b2cac24819 mm/gup: remove vmas array from internal GUP functions
Now we have eliminated all callers to GUP APIs which use the vmas
parameter, eliminate it altogether.

This eliminates a class of bugs where vmas might have been kept around
longer than the mmap_lock and thus we need not be concerned about locks
being dropped during this operation leaving behind dangling pointers.

This simplifies the GUP API and makes it considerably clearer as to its
purpose - follow flags are applied and if pinning, an array of pages is
returned.

Link: https://lkml.kernel.org/r/6811b4b2b4b3baf3dd07f422bb18853bb2cd09fb.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:26 -07:00
Lorenzo Stoakes
4c630f3074 mm/gup: remove vmas parameter from pin_user_pages()
We are now in a position where no caller of pin_user_pages() requires the
vmas parameter at all, so eliminate this parameter from the function and
all callers.

This clears the way to removing the vmas parameter from GUP altogether.

Link: https://lkml.kernel.org/r/195a99ae949c9f5cb589d2222b736ced96ec199a.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>	[qib]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com>	[drivers/media]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:26 -07:00
Lorenzo Stoakes
ca5e863233 mm/gup: remove vmas parameter from get_user_pages_remote()
The only instances of get_user_pages_remote() invocations which used the
vmas parameter were for a single page which can instead simply look up the
VMA directly. In particular:-

- __update_ref_ctr() looked up the VMA but did nothing with it so we simply
  remove it.

- __access_remote_vm() was already using vma_lookup() when the original
  lookup failed so by doing the lookup directly this also de-duplicates the
  code.

We are able to perform these VMA operations as we already hold the
mmap_lock in order to be able to call get_user_pages_remote().

As part of this work we add get_user_page_vma_remote() which abstracts the
VMA lookup, error handling and decrementing the page reference count should
the VMA lookup fail.

This forms part of a broader set of patches intended to eliminate the vmas
parameter altogether.

[akpm@linux-foundation.org: avoid passing NULL to PTR_ERR]
Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64)
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:26 -07:00
Lorenzo Stoakes
0b295316b3 mm/gup: remove unused vmas parameter from pin_user_pages_remote()
No invocation of pin_user_pages_remote() uses the vmas parameter, so
remove it.  This forms part of a larger patch set eliminating the use of
the vmas parameters altogether.

Link: https://lkml.kernel.org/r/28f000beb81e45bf538a2aaa77c90f5482b67a32.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:25 -07:00
Lorenzo Stoakes
54d020692b mm/gup: remove unused vmas parameter from get_user_pages()
Patch series "remove the vmas parameter from GUP APIs", v6.

(pin_/get)_user_pages[_remote]() each provide an optional output parameter
for an array of VMA objects associated with each page in the input range.

These provide the means for VMAs to be returned, as long as mm->mmap_lock
is never released during the GUP operation (i.e.  the internal flag
FOLL_UNLOCKABLE is not specified).

In addition, these VMAs can only be accessed with the mmap_lock held and
become invalidated the moment it is released.

The vast majority of invocations do not use this functionality and of
those that do, all but one case retrieve a single VMA to perform checks
upon.

It is not egregious in the single VMA cases to simply replace the
operation with a vma_lookup().  In these cases we duplicate the (fast)
lookup on a slow path already under the mmap_lock, abstracted to a new
get_user_page_vma_remote() inline helper function which also performs
error checking and reference count maintenance.

The special case is io_uring, where io_pin_pages() specifically needs to
assert that the VMAs underlying the range do not result in broken
long-term GUP file-backed mappings.

As GUP now internally asserts that FOLL_LONGTERM mappings are not
file-backed in a broken fashion (i.e.  requiring dirty tracking) - as
implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to
file-backed mappings" - this logic is no longer required and so we can
simply remove it altogether from io_uring.

Eliminating the vmas parameter eliminates an entire class of danging
pointer errors that might have occured should the lock have been
incorrectly released.

In addition, the API is simplified and now clearly expresses what it is
intended for - applying the specified GUP flags and (if pinning) returning
pinned pages.

This change additionally opens the door to further potential improvements
in GUP and the possible marrying of disparate code paths.

I have run this series against gup_test with no issues.

Thanks to Matthew Wilcox for suggesting this refactoring!


This patch (of 6):

No invocation of get_user_pages() use the vmas parameter, so remove it.

The GUP API is confusing and caveated.  Recent changes have done much to
improve that, however there is more we can do.  Exporting vmas is a prime
target as the caller has to be extremely careful to preclude their use
after the mmap_lock has expired or otherwise be left with dangling
pointers.

Removing the vmas parameter focuses the GUP functions upon their primary
purpose - pinning (and outputting) pages as well as performing the actions
implied by the input flags.

This is part of a patch series aiming to remove the vmas parameter
altogether.

Link: https://lkml.kernel.org/r/cover.1684350871.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Christian König <christian.koenig@amd.com> (for radeon parts)
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sean Christopherson <seanjc@google.com> (KVM)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:25 -07:00
Kefeng Wang
ecbb490d8e mm: page_alloc: move is_check_pages_enabled() into page_alloc.c
The is_check_pages_enabled() only used in page_alloc.c, move it into
page_alloc.c, also use it in free_tail_page_prepare().

Link: https://lkml.kernel.org/r/20230516063821.121844-14-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:25 -07:00
Kefeng Wang
e95d372c4c mm: page_alloc: move sysctls into it own fils
This moves all page alloc related sysctls to its own file, as part of the
kernel/sysctl.c spring cleaning, also move some functions declarations
from mm.h into internal.h.

Link: https://lkml.kernel.org/r/20230516063821.121844-13-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:24 -07:00
Kefeng Wang
5221b5a893 mm: vmscan: use gfp_has_io_fs()
Use gfp_has_io_fs() instead of open-code.

Link: https://lkml.kernel.org/r/20230516063821.121844-12-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:24 -07:00
Kefeng Wang
07f44ac3c9 mm: page_alloc: move pm_* function into power
pm_restrict_gfp_mask()/pm_restore_gfp_mask() only used in power, let's
move them out of page_alloc.c.

Adding a general gfp_has_io_fs() function which return true if gfp with
both __GFP_IO and __GFP_FS flags, then use it inside of
pm_suspended_storage(), also the pm_suspended_storage() is moved into
suspend.h.

Link: https://lkml.kernel.org/r/20230516063821.121844-11-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:24 -07:00