Sven reports an infinite loop in __alloc_pages_slowpath() for costly order
__GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination
can happen in a suspend/resume context where a GFP_KERNEL allocation can
have __GFP_IO masked out via gfp_allowed_mask.
Quoting Sven:
1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER)
with __GFP_RETRY_MAYFAIL set.
2. page alloc's __alloc_pages_slowpath tries to get a page from the
freelist. This fails because there is nothing free of that costly
order.
3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim,
which bails out because a zone is ready to be compacted; it pretends
to have made a single page of progress.
4. page alloc tries to compact, but this always bails out early because
__GFP_IO is not set (it's not passed by the snd allocator, and even
if it were, we are suspending so the __GFP_IO flag would be cleared
anyway).
5. page alloc believes reclaim progress was made (because of the
pretense in item 3) and so it checks whether it should retry
compaction. The compaction retry logic thinks it should try again,
because:
a) reclaim is needed because of the early bail-out in item 4
b) a zonelist is suitable for compaction
6. goto 2. indefinite stall.
(end quote)
The immediate root cause is confusing the COMPACT_SKIPPED returned from
__alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be
indicating a lack of order-0 pages, and in step 5 evaluating that in
should_compact_retry() as a reason to retry, before incrementing and
limiting the number of retries. There are however other places that
wrongly assume that compaction can happen while we lack __GFP_IO.
To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO
evaluation and switch the open-coded test in try_to_compact_pages() to use
it.
Also use the new helper in:
- compaction_ready(), which will make reclaim not bail out in step 3, so
there's at least one attempt to actually reclaim, even if chances are
small for a costly order
- in_reclaim_compaction() which will make should_continue_reclaim()
return false and we don't over-reclaim unnecessarily
- in __alloc_pages_slowpath() to set a local variable can_compact,
which is then used to avoid retrying reclaim/compaction for costly
allocations (step 5) if we can't compact and also to skip the early
compaction attempt that we do in some cases
Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz
Fixes: 3250845d05 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sven van Ashbrook <svenva@chromium.org>
Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%2Bg@mail.gmail.com/
Tested-by: Karthikeyan Ramasubramanian <kramasub@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Curtis Malainey <cujomalainey@chromium.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
are included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the
series
"maple_tree: add mt_free_one() and mt_attr() helpers"
"Some cleanups of maple tree"
- In the series "mm: use memmap_on_memory semantics for dax/kmem"
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few
fixes) in the patch series
"Add folio_zero_tail() and folio_fill_tail()"
"Make folio_start_writeback return void"
"Fix fault handler's handling of poisoned tail pages"
"Convert aops->error_remove_page to ->error_remove_folio"
"Finish two folio conversions"
"More swap folio conversions"
- Kefeng Wang has also contributed folio-related work in the series
"mm: cleanup and use more folio in page fault"
- Jim Cromie has improved the kmemleak reporting output in the
series "tweak kmemleak report format".
- In the series "stackdepot: allow evicting stack traces" Andrey
Konovalov to permits clients (in this case KASAN) to cause
eviction of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series "mm:
page_alloc: fixes for high atomic reserve caluculations".
- Dmitry Rokosov has added to the samples/ dorectory some sample
code for a userspace memcg event listener application. See the
series "samples: introduce cgroup events listeners".
- Some mapletree maintanance work from Liam Howlett in the series
"maple_tree: iterator state changes".
- Nhat Pham has improved zswap's approach to writeback in the
series "workload-specific and memory pressure-driven zswap
writeback".
- DAMON/DAMOS feature and maintenance work from SeongJae Park in
the series
"mm/damon: let users feed and tame/auto-tune DAMOS"
"selftests/damon: add Python-written DAMON functionality tests"
"mm/damon: misc updates for 6.8"
- Yosry Ahmed has improved memcg's stats flushing in the series
"mm: memcg: subtree stats flushing and thresholds".
- In the series "Multi-size THP for anonymous memory" Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series "More buffer_head
cleanups".
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
"userfaultfd move option". UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a "KSM Advisor", in the series
"mm/ksm: Add ksm advisor". This is a governor which tunes KSM's
scanning aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory
use in the series "mm/zswap: dstmem reuse optimizations and
cleanups".
- Matthew Wilcox has performed some maintenance work on the
writeback code, both code and within filesystems. The series is
"Clean up the writeback paths".
- Andrey Konovalov has optimized KASAN's handling of alloc and
free stack traces for secondary-level allocators, in the series
"kasan: save mempool stack traces".
- Andrey also performed some KASAN maintenance work in the series
"kasan: assorted clean-ups".
- David Hildenbrand has gone to town on the rmap code. Cleanups,
more pte batching, folio conversions and more. See the series
"mm/rmap: interface overhaul".
- Kinsey Ho has contributed some maintenance work on the MGLRU
code in the series "mm/mglru: Kconfig cleanup".
- Matthew Wilcox has contributed lruvec page accounting code
cleanups in the series "Remove some lruvec page accounting
functions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
=0NHs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the series
'maple_tree: add mt_free_one() and mt_attr() helpers'
'Some cleanups of maple tree'
- In the series 'mm: use memmap_on_memory semantics for dax/kmem'
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few fixes)
in the patch series
'Add folio_zero_tail() and folio_fill_tail()'
'Make folio_start_writeback return void'
'Fix fault handler's handling of poisoned tail pages'
'Convert aops->error_remove_page to ->error_remove_folio'
'Finish two folio conversions'
'More swap folio conversions'
- Kefeng Wang has also contributed folio-related work in the series
'mm: cleanup and use more folio in page fault'
- Jim Cromie has improved the kmemleak reporting output in the series
'tweak kmemleak report format'.
- In the series 'stackdepot: allow evicting stack traces' Andrey
Konovalov to permits clients (in this case KASAN) to cause eviction
of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series 'mm:
page_alloc: fixes for high atomic reserve caluculations'.
- Dmitry Rokosov has added to the samples/ dorectory some sample code
for a userspace memcg event listener application. See the series
'samples: introduce cgroup events listeners'.
- Some mapletree maintanance work from Liam Howlett in the series
'maple_tree: iterator state changes'.
- Nhat Pham has improved zswap's approach to writeback in the series
'workload-specific and memory pressure-driven zswap writeback'.
- DAMON/DAMOS feature and maintenance work from SeongJae Park in the
series
'mm/damon: let users feed and tame/auto-tune DAMOS'
'selftests/damon: add Python-written DAMON functionality tests'
'mm/damon: misc updates for 6.8'
- Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
memcg: subtree stats flushing and thresholds'.
- In the series 'Multi-size THP for anonymous memory' Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series 'More buffer_head
cleanups'.
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
Add ksm advisor'. This is a governor which tunes KSM's scanning
aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory use
in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.
- Matthew Wilcox has performed some maintenance work on the writeback
code, both code and within filesystems. The series is 'Clean up the
writeback paths'.
- Andrey Konovalov has optimized KASAN's handling of alloc and free
stack traces for secondary-level allocators, in the series 'kasan:
save mempool stack traces'.
- Andrey also performed some KASAN maintenance work in the series
'kasan: assorted clean-ups'.
- David Hildenbrand has gone to town on the rmap code. Cleanups, more
pte batching, folio conversions and more. See the series 'mm/rmap:
interface overhaul'.
- Kinsey Ho has contributed some maintenance work on the MGLRU code
in the series 'mm/mglru: Kconfig cleanup'.
- Matthew Wilcox has contributed lruvec page accounting code cleanups
in the series 'Remove some lruvec page accounting functions'"
* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
mm, treewide: introduce NR_PAGE_ORDERS
selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
selftests/mm: skip test if application doesn't has root privileges
selftests/mm: conform test to TAP format output
selftests: mm: hugepage-mmap: conform to TAP format output
selftests/mm: gup_test: conform test to TAP format output
mm/selftests: hugepage-mremap: conform test to TAP format output
mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
mm/memcontrol: remove __mod_lruvec_page_state()
mm/khugepaged: use a folio more in collapse_file()
slub: use a folio in __kmalloc_large_node
slub: use folio APIs in free_large_kmalloc()
slub: use alloc_pages_node() in alloc_slab_page()
mm: remove inc/dec lruvec page state functions
mm: ratelimit stat flush from workingset shrinker
kasan: stop leaking stack trace handles
mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
mm/mglru: add dummy pmd_dirty()
...
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.
To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.
Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Demotion can work well without CONFIG_NUMA_BALANCING. But the commit
23e9f01389 ("mm/vmstat: move pgdemote_* to per-node stats") wrongly hid
it behind CONFIG_NUMA_BALANCING.
Fix it by moving them out of CONFIG_NUMA_BALANCING.
Link: https://lkml.kernel.org/r/20231229022651.3229174-1-lizhijian@fujitsu.com
Fixes: 23e9f01389 ("mm/vmstat: move pgdemote_* to per-node stats")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Improve code readability by removing CONFIG_TRANSPARENT_HUGEPAGE,
since the compiler should be able to automatically optimize out the
code that promotes THPs during page table walks.
No functional changes.
Link: https://lkml.kernel.org/r/20231227141205.2200125-6-kinseyho@google.com
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Donet Tom <donettom@linux.vnet.ibm.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove CONFIG_MEMCG in a refactoring to improve code readability at
the cost of a few bytes in struct lru_gen_folio per node when
CONFIG_MEMCG=n.
Link: https://lkml.kernel.org/r/20231227141205.2200125-4-kinseyho@google.com
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Donet Tom <donettom@linux.vnet.ibm.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add CONFIG_LRU_GEN_WALKS_MMU such that if disabled, the code that
walks page tables to promote pages into the youngest generation will
not be built.
Also improves code readability by adding two helper functions
get_mm_state() and get_next_mm().
Link: https://lkml.kernel.org/r/20231227141205.2200125-3-kinseyho@google.com
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Donet Tom <donettom@linux.vnet.ibm.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stats flushing for memcg currently follows the following rules:
- Always flush the entire memcg hierarchy (i.e. flush the root).
- Only one flusher is allowed at a time. If someone else tries to flush
concurrently, they skip and return immediately.
- A periodic flusher flushes all the stats every 2 seconds.
The reason this approach is followed is because all flushes are serialized
by a global rstat spinlock. On the memcg side, flushing is invoked from
userspace reads as well as in-kernel flushers (e.g. reclaim, refault,
etc). This approach aims to avoid serializing all flushers on the global
lock, which can cause a significant performance hit under high
concurrency.
This approach has the following problems:
- Occasionally a userspace read of the stats of a non-root cgroup will
be too expensive as it has to flush the entire hierarchy [1].
- Sometimes the stats accuracy are compromised if there is an ongoing
flush, and we skip and return before the subtree of interest is
actually flushed, yielding stale stats (by up to 2s due to periodic
flushing). This is more visible when reading stats from userspace,
but can also affect in-kernel flushers.
The latter problem is particulary a concern when userspace reads stats
after an event occurs, but gets stats from before the event. Examples:
- When memory usage / pressure spikes, a userspace OOM handler may look
at the stats of different memcgs to select a victim based on various
heuristics (e.g. how much private memory will be freed by killing
this). Reading stale stats from before the usage spike in this case
may cause a wrongful OOM kill.
- A proactive reclaimer may read the stats after writing to
memory.reclaim to measure the success of the reclaim operation. Stale
stats from before reclaim may give a false negative.
- Reading the stats of a parent and a child memcg may be inconsistent
(child larger than parent), if the flush doesn't happen when the
parent is read, but happens when the child is read.
As for in-kernel flushers, they will occasionally get stale stats. No
regressions are currently known from this, but if there are regressions,
they would be very difficult to debug and link to the source of the
problem.
This patch aims to fix these problems by restoring subtree flushing, and
removing the unified/coalesced flushing logic that skips flushing if there
is an ongoing flush. This change would introduce a significant regression
with global stats flushing thresholds. With per-memcg stats flushing
thresholds, this seems to perform really well. The thresholds protect the
underlying lock from unnecessary contention.
This patch was tested in two ways to ensure the latency of flushing is
up to par, on a machine with 384 cpus:
- A synthetic test with 5000 concurrent workers in 500 cgroups doing
allocations and reclaim, as well as 1000 readers for memory.stat
(variation of [2]). No regressions were noticed in the total runtime.
Note that significant regressions in this test are observed with
global stats thresholds, but not with per-memcg thresholds.
- A synthetic stress test for concurrently reading memcg stats while
memory allocation/freeing workers are running in the background,
provided by Wei Xu [3]. With 250k threads reading the stats every
100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01%
of reads take more than 1ms, and no reads take more than 100ms.
[1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
[akpm@linux-foundation.org: fix mm/zswap.c]
[yosryahmed@google.com: remove stats flushing mutex]
Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com
Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the effort to reduce zombie memcgs [1], it was discovered that the
memcg LRU doesn't apply enough pressure on offlined memcgs. Specifically,
instead of rotating them to the tail of the current generation
(MEMCG_LRU_TAIL) for a second attempt, it moves them to the next
generation (MEMCG_LRU_YOUNG) after the first attempt.
Not applying enough pressure on offlined memcgs can cause them to build
up, and this can be particularly harmful to memory-constrained systems.
On Pixel 8 Pro, launching apps for 50 cycles:
Before After Change
Zombie memcgs 45 35 -22%
[1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20231208061407.2125867-4-yuzhao@google.com
Fixes: e4dde56cd2 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: T.J. Mercier <tjmercier@google.com>
Tested-by: T.J. Mercier <tjmercier@google.com>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru:
try to stop at high watermarks"), it was discovered that the memcg LRU can
breach the thrashing protection imposed by min_ttl_ms.
Before the memcg LRU:
kswapd()
shrink_node_memcgs()
mem_cgroup_iter()
inc_max_seq() // always hit a different memcg
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
After the memcg LRU:
kswapd()
shrink_many()
restart:
iterate the memcg LRU:
inc_max_seq() // occasionally hit the same memcg
if raced with lru_gen_rotate_memcg():
goto restart
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
Specifically, when the restart happens in shrink_many(), it needs to stick
with the (memcg LRU) generation it began with. In other words, it should
neither re-read memcg_lru->seq nor age an lruvec of a different
generation. Otherwise it can hit the same memcg multiple times without
giving lru_gen_age_node() a chance to check the timestamp of that memcg's
oldest generation (against min_ttl_ms).
[1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20231208061407.2125867-3-yuzhao@google.com
Fixes: e4dde56cd2 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: T.J. Mercier <tjmercier@google.com>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The initial MGLRU patchset didn't include the memcg LRU support, and it
relied on should_abort_scan(), added by commit f76c833788 ("mm:
multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid
overshooting their aggregate reclaim target by too much".
Later on when the memcg LRU was added, should_abort_scan() was deemed
unnecessary, and the test results [1] showed no side effects after it was
removed by commit a579086c99 ("mm: multi-gen LRU: remove eviction
fairness safeguard").
However, that test used memory.reclaim, which sets nr_to_reclaim to
SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1 pages,
i.e., from nr_reclaimed=nr_to_reclaim-1 to
nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch
size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore
that test isn't able to reproduce the worst case scenario, i.e., kswapd
overshooting GBs on large systems and "consuming 100% CPU" (see the Closes
tag).
Bring back a simplified version of should_abort_scan() on top of the memcg
LRU, so that kswapd stops when all eligible zones are above their
respective high watermarks plus a small delta to lower the chance of
KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies to order-0
reclaim, meaning compaction-induced reclaim can still run wild (which is a
different problem).
On Android, launching 55 apps sequentially:
Before After Change
pgpgin 838377172 802955040 -4%
pgpgout 38037080 34336300 -10%
[1] https://lore.kernel.org/20221222041905.2431096-1-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20231208061407.2125867-2-yuzhao@google.com
Fixes: a579086c99 ("mm: multi-gen LRU: remove eviction fairness safeguard")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/
Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Unmapped folios accessed through file descriptors can be underprotected.
Those folios are added to the oldest generation based on:
1. The fact that they are less costly to reclaim (no need to walk the
rmap and flush the TLB) and have less impact on performance (don't
cause major PFs and can be non-blocking if needed again).
2. The observation that they are likely to be single-use. E.g., for
client use cases like Android, its apps parse configuration files
and store the data in heap (anon); for server use cases like MySQL,
it reads from InnoDB files and holds the cached data for tables in
buffer pools (anon).
However, the oldest generation can be very short lived, and if so, it
doesn't provide the PID controller with enough time to respond to a surge
of refaults. (Note that the PID controller uses weighted refaults and
those from evicted generations only take a half of the whole weight.) In
other words, for a short lived generation, the moving average smooths out
the spike quickly.
To fix the problem:
1. For folios that are already on LRU, if they can be beyond the
tracking range of tiers, i.e., five accesses through file
descriptors, move them to the second oldest generation to give them
more time to age. (Note that tiers are used by the PID controller
to statistically determine whether folios accessed multiple times
through file descriptors are worth protecting.)
2. When adding unmapped folios to LRU, adjust the placement of them so
that they are not too close to the tail. The effect of this is
similar to the above.
On Android, launching 55 apps sequentially:
Before After Change
workingset_refault_anon 25641024 25598972 0%
workingset_refault_file 115016834 106178438 -8%
Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Demotion will migrate pages across nodes. Previously, only the global
demotion statistics were accounted for. Changed them to per-node
statistics, making it easier to observe where demotion occurs on each
node.
This will help to identify which nodes are under pressure.
This patch also make pgdemote_* behind CONFIG_NUMA_BALANCING, since
demotion is not available for !CONFIG_NUMA_BALANCING
With this patch, here is a sample where node0 node1 are DRAM,
node3 is PMEM:
Global stats:
$ grep demote /proc/vmstat
pgdemote_kswapd 254288
pgdemote_direct 113497
pgdemote_khugepaged 0
Per-node stats:
$ grep demote /sys/devices/system/node/node0/vmstat # demotion source
pgdemote_kswapd 68454
pgdemote_direct 83431
pgdemote_khugepaged 0
$ grep demote /sys/devices/system/node/node1/vmstat # demotion source
pgdemote_kswapd 185834
pgdemote_direct 30066
pgdemote_khugepaged 0
$ grep demote /sys/devices/system/node/node3/vmstat # demotion target
pgdemote_kswapd 0
pgdemote_direct 0
pgdemote_khugepaged 0
Link: https://lkml.kernel.org/r/20231103031450.1456523-1-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current memory reclaim delay statistics only count the direct memory
reclaim of the task in do_try_to_free_pages(). In systems with NUMA open,
some tasks occasionally experience slower response times, but the total
count of reclaim does not increase, using ftrace can show that
node_reclaim has occurred.
The memory reclaim occurring in get_page_from_freelist() is also due to
heavy memory load. To get the impact of tasks in memory reclaim, this
patch adds the statistics of the memory reclaim delay statistics for
__node_reclaim().
Link: https://lkml.kernel.org/r/181C946095F0252B+7cc60eca-1abf-4502-aad3-ffd8ef89d910@ex.bilibili.com
Signed-off-by: Wen Yu Li <wenyuli@ex.bilibili.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: <wangyun@bilibili.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This isolate_mode_t flag is effectively unused since 89f6c88a6a ("mm:
__isolate_lru_page_prepare() in isolate_migratepages_block()") as
sc->may_unmap is now checked directly (and only node_reclaim has a mode
that sets it to 0). The last remaining place is mm_vmscan_lru_isolate
tracepoint for the isolate_mode parameter. That one was mainly used to
indicate the active/inactive mode, which the trace-vmscan-postprocess.pl
script consumed, but that got silently broken. After fixing the script by
the previous patch, it does not need the isolate_mode anymore. So just
remove the parameter and with that the whole ISOLATE_UNMAPPED flag.
Link: https://lkml.kernel.org/r/20230914131637.12204-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
At present, we support per-memcg reclaim strategy, however we do not know
the number of transparent huge pages being reclaimed, as we know the
transparent huge pages need to be splited before reclaim them, and they
will bring some performance bottleneck effect. for example, when two
memcg (A & B) are doing reclaim for anonymous pages at same time, and 'A'
memcg is reclaiming a large number of transparent huge pages, we can
better analyze that the performance bottleneck will be caused by 'A'
memcg. therefore, in order to better analyze such problems, there add THP
swap out info for per-memcg.
[akpm@linux-foundation.orgL fix swap_writepage_fs(), per Johannes]
Link: https://lkml.kernel.org/r/20230913213343.GB48476@cmpxchg.org
Link: https://lkml.kernel.org/r/20230913164938.16918-1-vernhao@tencent.com
Signed-off-by: Xin Hao <vernhao@tencent.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When looking at the code in the memory part, I found that the purpose of
the function prepare_scan_countis very different from the function name.
It is easy to misunderstand when reading.The function prepare_scan_count
mainly completes the assignment of the scan_control structure.Therefore, I
suggest that the function name can be changed to prepare_scan_control,
which is easier to understand.
Link: https://lkml.kernel.org/r/20230912085923.27238-1-liujinlong@kylinos.cn
Signed-off-by: liujinlong <liujinlong@kylinos.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The mm/vmscan.c file is too large, so separate the shrinker-related code
from it into a separate file. No functional changes.
Link: https://lkml.kernel.org/r/20230911092517.64141-3-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sean Paul <sean@poorly.run>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If panic is enable,the err information will not be printed before bugon,
So swap it. Print the return value of PTR_ERR(pgdat->kswapd) also.
Link: https://lkml.kernel.org/r/20230906083700.181-1-angus.chen@jaguarmicro.com
Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In skip_cma(), we can use folio_migratetype() to replace
get_pageblock_migratetype().
Link: https://lkml.kernel.org/r/20230825075735.52436-1-user@VERNHAO-MC1
Signed-off-by: Vern Hao <vernhao@tencent.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's simply work on the folio directly and remove the helpers.
Link: https://lkml.kernel.org/r/20230821160849.531668-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Chris Li <chrisl@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch is based on the commit 5da226dbfce3("mm: skip CMA pages when
they are not available") which skips cma pages reclaim when they are not
eligible for the current allocation context. In mglru, such pages are
added to the tail of the immediate generation to maintain better LRU
order, which is unlike the case of conventional LRU where such pages are
directly added to the head of the LRU list(akin to adding to head of the
youngest generation in mglru).
No observable issue without this patch on MGLRU, but logically it make
sense to skip the CMA page reclaim when those pages can't be satisfied for
the current allocation context.
Link: https://lkml.kernel.org/r/1691568344-13475-1-git-send-email-quic_charante@quicinc.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
walk->can_swap might be invalid since it's not guaranteed to be
initialized for the particular lruvec. Instead deduce it from the folio
type (anon/file).
Link: https://lkml.kernel.org/r/20230802025606.346758-3-kaleshsingh@google.com
Fixes: 018ee47f14 ("mm: multi-gen LRU: exploit locality in rmap")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
inc_max_seq() will try to inc_min_seq() if nr_gens == MAX_NR_GENS. This
is because the generations are reused (the last oldest now empty
generation will become the next youngest generation).
inc_min_seq() is retried until successful, dropping the lru_lock
and yielding the CPU on each failure, and retaking the lock before
trying again:
while (!inc_min_seq(lruvec, type, can_swap)) {
spin_unlock_irq(&lruvec->lru_lock);
cond_resched();
spin_lock_irq(&lruvec->lru_lock);
}
However, the initial condition that required incrementing the min_seq
(nr_gens == MAX_NR_GENS) is not retested. This can change by another
call to inc_max_seq() from run_aging() with force_scan=true from the
debugfs interface.
Since the eviction stalls when the nr_gens == MIN_NR_GENS, avoid
unnecessarily incrementing the min_seq by rechecking the number of
generations before each attempt.
This issue was uncovered in previous discussion on the list by Yu Zhao
and Aneesh Kumar [1].
[1] https://lore.kernel.org/linux-mm/CAOUHufbO7CaVm=xjEb1avDhHVvnC8pJmGyKcFf2iY_dpf+zR3w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230802025606.346758-2-kaleshsingh@google.com
Fixes: d6c3af7d8a ("mm: multi-gen LRU: debugfs interface")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MGLRU has a LRU list for each zone for each type (anon/file) in each
generation:
long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
The min_seq (oldest generation) can progress independently for each
type but the max_seq (youngest generation) is shared for both anon and
file. This is to maintain a common frame of reference.
In order for eviction to advance the min_seq of a type, all the per-zone
lists in the oldest generation of that type must be empty.
The eviction logic only considers pages from eligible zones for
eviction or promotion.
scan_folios() {
...
for (zone = sc->reclaim_idx; zone >= 0; zone--) {
...
sort_folio(); // Promote
...
isolate_folio(); // Evict
}
...
}
Consider the system has the movable zone configured and default 4
generations. The current state of the system is as shown below
(only illustrating one type for simplicity):
Type: ANON
Zone DMA32 Normal Movable Device
Gen 0 0 0 4GB 0
Gen 1 0 1GB 1MB 0
Gen 2 1MB 4GB 1MB 0
Gen 3 1MB 1MB 1MB 0
Now consider there is a GFP_KERNEL allocation request (eligible zone
index <= Normal), evict_folios() will return without doing any work
since there are no pages to scan in the eligible zones of the oldest
generation. Reclaim won't make progress until triggered from a ZONE_MOVABLE
allocation request; which may not happen soon if there is a lot of free
memory in the movable zone. This can lead to OOM kills, although there
is 1GB pages in the Normal zone of Gen 1 that we have not yet tried to
reclaim.
This issue is not seen in the conventional active/inactive LRU since
there are no per-zone lists.
If there are no (not enough) folios to scan in the eligible zones, move
folios from ineligible zone (zone_index > reclaim_index) to the next
generation. This allows for the progression of min_seq and reclaiming
from the next generation (Gen 1).
Qualcomm, Mediatek and raspberrypi [1] discovered this issue independently.
[1] https://github.com/raspberrypi/linux/issues/5395
Link: https://lkml.kernel.org/r/20230802025606.346758-1-kaleshsingh@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When a memcg is in the process of being released mem_cgroup_tryget will
fail because its reference count has already reached 0. This can happen
during reclaim if the memcg has already been offlined, and we reclaim all
remaining pages attributed to the offlined memcg. shrink_many attempts to
skip the empty memcg in this case, and continue reclaiming from the
remaining memcgs in the old generation. If there is only one memcg
remaining, or if all remaining memcgs are in the process of being released
then shrink_many will spin until all memcgs have finished being released.
The release occurs through a workqueue, so it can take a while before
kswapd is able to make any further progress.
This fix results in reductions in kswapd activity and direct reclaim in
a test where 28 apps (working set size > total memory) are repeatedly
launched in a random sequence:
A B delta ratio(%)
allocstall_movable 5962 3539 -2423 -40.64
allocstall_normal 2661 2417 -244 -9.17
kswapd_high_wmark_hit_quickly 53152 7594 -45558 -85.71
pageoutrun 57365 11750 -45615 -79.52
Link: https://lkml.kernel.org/r/20230814151636.1639123-1-tjmercier@google.com
Fixes: e4dde56cd2 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas. Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.
The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks. With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them. The change ensures vmas are properly locked during such
walks.
A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration. Without this change a concurrent page
can be faulted into the area and be left out of migration.
Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.
This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache. The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.
The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.
The second patch is the actual fix. Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set. folio_needs_release() is altered to
add a check for this.
This patch (of 2):
Make filemap_release_folio() check folio_has_private(). Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.
There are a couple of sites in mm/vscan.c that this can't so easily be
done. In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.
In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.
A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory. This will acquire
additional parts to the condition in a future patch.
After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.
Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When memory.reclaim was introduced, it became the first case where
cgroup_reclaim() is true for the root cgroup. Johannes concluded [1] that
for most cases this is okay, except for one case. Historically, kswapd
would throttle reclaim on a node if a lot of pages marked for reclaim are
under writeback (aka the node is congested). This occurred by setting
LRUVEC_CONGESTED bit in lruvec->flags. The bit would be cleared when the
node is balanced.
Similarly, cgroup reclaim would set the same bit when an lruvec is
congested, and clear it on the way out of reclaim (to throttle local
reclaimers).
Before the introduction of memory.reclaim, the root memcg was the only
target of kswapd reclaim, and non-root memcgs were the only targets of
cgroup reclaim, so they would never interfere. Using the same bit for
both was fine. After memory.reclaim, it is possible for cgroup reclaim on
the root cgroup to clear the bit set by kswapd. This would result in
reclaim on the node to be unthrottled before the node is balanced.
Fix this by introducing separate bits for cgroup-level and node-level
congestion. kswapd can unthrottle an lruvec that is marked as congested
by cgroup reclaim (as the entire node should no longer be congested), but
not vice versa (to prevent premature unthrottling before the entire node
is balanced).
[1]https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/
Link: https://lkml.kernel.org/r/20230621023101.432780-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Evidently, global_reclaim() can be a confusing name. Especially that it
used to exist before with a subtly different definition (removed by commit
b5ead35e7e ("mm: vmscan: naming fixes: global_reclaim() and
sane_reclaim()"). It can be interpreted as non-cgroup reclaim, even
though it returns true for cgroup reclaim on the root memcg (through
memory.reclaim).
Rename it to root_reclaim() in an attempt to make it less ambiguous, and
add documentation to it as well as cgroup_reclaim.
Link: https://lkml.kernel.org/r/20230621023053.432374-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MGLRU's walk_pte_range() use the safer pte_offset_map_nolock(), rather
than pte_lockptr(), to get the ptl for its trylock. Just return false and
move on to next extent if it fails, like when the trylock fails. Remove
the VM_WARN_ON_ONCE(pmd_leaf) since that will happen, rarely.
Link: https://lkml.kernel.org/r/51ece73e-7398-2e4a-2384-56708c87844f@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add __meminit to kswapd_run() and kswapd_stop() to ensure they're default
to __init when memory hotplug is not enabled.
Link: https://lkml.kernel.org/r/20230606121813.242163-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit f95bdb700b.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
After discussion, we will try to use the refcount+RCU method [2] proposed
by Dave Chinner to continue to re-implement the lockless slab shrink. So
revert the shrinker_srcu related changes first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-8-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit caa05325c9.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
After discussion, we will try to use the refcount+RCU method [2] proposed
by Dave Chinner to continue to re-implement the lockless slab shrink. So
revert the shrinker_srcu related changes first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-7-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 475733dda5.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
We will try to use the refcount+RCU method [2] proposed by Dave Chinner to
continue to re-implement the lockless slab shrink. So revert the
shrinker_srcu related changes first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-6-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit b3cabea3c9.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be careful
to not frequently check for SRCU read-side critical section exits. Therefore,
even if no one is currently in the SRCU read-side critical section,
synchronize_srcu() cannot return quickly. That's why unregister_shrinker()
has become slower.
We will try to use the refcount+RCU method [2] proposed by Dave Chinner
to continue to re-implement the lockless slab shrink. Because there will
be other readers after reverting the shrinker_srcu related changes, so
it is better to restore to hold read lock to reparent shrinker nr_deferred.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-4-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 1643db98d9.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
We will try to use the refcount+RCU method [2] proposed by Dave Chinner to
continue to re-implement the lockless slab shrink. So we still need
shrinker_rwsem in synchronize_shrinkers() after reverting the
shrinker_srcu related changes.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-3-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "revert shrinker_srcu related changes".
This patch (of 7):
This reverts commit cf2e309ebc.
Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700b ("mm: vmscan: make
global slab shrink lockless"). The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly. That's why
unregister_shrinker() has become slower.
After discussion, we will try to use the refcount+RCU method [2] proposed
by Dave Chinner to continue to re-implement the lockless slab shrink. So
revert the shrinker_mutex back to shrinker_rwsem first.
[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230609081518.3039120-1-qi.zheng@linux.dev
Link: https://lkml.kernel.org/r/20230609081518.3039120-2-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A customer provided evidence indicating that a process
was stalled in direct reclaim:
- The process was trapped in throttle_direct_reclaim().
The function wait_event_killable() was called to wait condition
allow_direct_reclaim(pgdat) for current node to be true.
The allow_direct_reclaim(pgdat) examined the number of free pages
on the node by zone_page_state() which just returns value in
zone->vm_stat[NR_FREE_PAGES].
- On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
However, the freelist on this node was not empty.
- This inconsistent of vmstat value was caused by percpu vmstat on
nohz_full cpus. Every increment/decrement of vmstat is performed
on percpu vmstat counter at first, then pooled diffs are cumulated
to the zone's vmstat counter in timely manner. However, on nohz_full
cpus (in case of this customer's system, 48 of 52 cpus) these pooled
diffs were not cumulated once the cpu had no event on it so that
the cpu started sleeping infinitely.
I checked percpu vmstat and found there were total 69 counts not
cumulated to the zone's vmstat counter yet.
- In this situation, kswapd did not help the trapped process.
In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
of free pages on the node by zone_page_state_snapshot() which
checks pending counts on percpu vmstat.
Therefore kswapd could know there were 69 free pages correctly.
Since zone->_watermark = {8, 20, 32}, kswapd did not work because
69 was greater than 32 as high watermark.
Change allow_direct_reclaim to use zone_page_state_snapshot, which
allows a more precise version of the vmstat counters to be used.
allow_direct_reclaim will only be called from try_to_free_pages,
which is not a hot path.
Testing: Due to difficulties accessing the system, it has not been
possible for the reproducer to test the patch (however its
clear from available data and analysis that it should fix it).
Link: https://lkml.kernel.org/r/20230530145335.677325196@redhat.com
Reviewed-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On Android app cycle workloads, MGLRU showed a significant reduction in
workingset refaults although pgpgin/pswpin remained relatively unchanged.
This indicated MGLRU may be undercounting workingset refaults.
This has impact on userspace programs, like Android's LMKD, that monitor
workingset refault statistics to detect thrashing.
It was found that refaults were only accounted if the MGLRU shadow entry
was for a recently evicted folio. However, recently evicted folios should
be accounted as workingset activation, and refaults should be accounted
regardless of recency.
Fix MGLRU's workingset refault and activation accounting to more closely
match that of the conventional active/inactive LRU.
Link: https://lkml.kernel.org/r/20230523205922.3852731-1-kaleshsingh@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add helpers to page table walking code:
- Clarifies intent via name "should_walk_mmu" and "should_clear_pmd_young"
- Avoids repeating same logic in two places
Link: https://lkml.kernel.org/r/20230522112058.2965866-3-talumbau@google.com
Signed-off-by: T.J. Alumbaugh <talumbau@google.com>
Reviewed-by: Yuanchu Xie <yuanchu@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>