Commit Graph

1324439 Commits

Author SHA1 Message Date
Kairui Song
38fdedd883 mm/swap_cgroup: remove swap_cgroup_cmpxchg
This function is never used after commit 6b611388b6 ("memcg-v1: remove
charge move code").

Link: https://lkml.kernel.org/r/20241218114633.85196-3-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:08 -08:00
Kairui Song
20214ad102 mm, memcontrol: avoid duplicated memcg enable check
Patch series "mm/swap_cgroup: remove global swap cgroup lock", v3.

This series removes the global swap cgroup lock.  The critical section of
this lock is very short but it's still a bottle neck for mass parallel
swap workloads.

Up to 10% performance gain for tmpfs build kernel test on a 48c96t system
under memory pressure, and no regression for other cases:


This patch (of 3):

mem_cgroup_uncharge_swap() includes a mem_cgroup_disabled() check,
so the caller doesn't need to check that.

Link: https://lkml.kernel.org/r/20241218114633.85196-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20241218114633.85196-2-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:08 -08:00
Liam R. Howlett
5c71d9a623 test_maple_tree: test exhausted upper limit of mtree_alloc_cyclic()
When the upper bound of the search is exhausted, the maple state may be
returned in an error state of -EBUSY.  This means maple state needs to be
reset before the second search in mas_alloc_cylic() to ensure the search
happens.  This test ensures the issue is not recreated.

Link: https://lkml.kernel.org/r/20241216190113.1226145-3-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Yang Erkun <yangerkun@huawei.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com> says:
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:08 -08:00
Thomas Weißschuh
9eb6b0801a mm/page_idle: constify 'struct bin_attribute'
The sysfs core now allows instances of 'struct bin_attribute' to be moved
into read-only memory.  Make use of that to protect them against
accidental or malicious modifications.

Link: https://lkml.kernel.org/r/20241216-sysfs-const-bin_attr-page_idle-v1-1-cc01ecc55196@weissschuh.net
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:07 -08:00
Andrew Morton
23b8261fa8 mm/huge_memory.c: rename shadowed local
split_huge_pages_write() has a lccal `buf' which shadows incoming arg
`buf'.  Reviewer confusion resulted.  Rename the inner local to `tok_buf'.

Cc: Leo Stone <leocstone@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:07 -08:00
Lorenzo Stoakes
b931c5329e tools: testing: add simple __mmap_region() userland test
Introduce demonstrative, basic, __mmap_region() test upon which we can
base further work upon moving forwards.

This simply asserts that mappings can be made and merges occur as
expected.

As part of this change, fix the security_vm_enough_memory_mm() stub which
was previously incorrectly implemented.

Link: https://lkml.kernel.org/r/20241213162409.41498-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:07 -08:00
Joanne Koong
aca5246a78 fuse: remove tmp folio for writebacks and internal rb tree
In the current FUSE writeback design (see commit 3be5a52b30 ("fuse:
support writable mmap")), a temp page is allocated for every dirty page to
be written back, the contents of the dirty page are copied over to the
temp page, and the temp page gets handed to the server to write back.

This is done so that writeback may be immediately cleared on the dirty page,
and this in turn is done for two reasons:
a) in order to mitigate the following deadlock scenario that may arise
if reclaim waits on writeback on the dirty page to complete:
* single-threaded FUSE server is in the middle of handling a request
  that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
  direct reclaim
b) in order to unblock internal (eg sync, page compaction) waits on
writeback without needing the server to complete writing back to disk,
which may take an indeterminate amount of time.

With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
the situations described above, FUSE writeback does not need to use temp
pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.

This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings and
removes the temporary pages + extra copying and the internal rb tree.

fio benchmarks --
(using averages observed from 10 runs, throwing away outliers)

Setup:
sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
 ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount

fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount

        bs =  1k          4k            1M
Before  351 MiB/s     1818 MiB/s     1851 MiB/s
After   341 MiB/s     2246 MiB/s     2685 MiB/s
% diff        -3%          23%         45%

Link: https://lkml.kernel.org/r/20241122232359.429647-6-joannelkoong@gmail.com
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:07 -08:00
Joanne Koong
2cdf65eaaa mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
For migrations called in MIGRATE_SYNC mode, skip migrating the folio if it
is under writeback and has the AS_WRITEBACK_INDETERMINATE flag set on its
mapping.  If the AS_WRITEBACK_INDETERMINATE flag is set on the mapping,
the writeback may take an indeterminate amount of time to complete, and
waits may get stuck.

Link: https://lkml.kernel.org/r/20241122232359.429647-5-joannelkoong@gmail.com
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:06 -08:00
Joanne Koong
0da1c1a78d fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings
For filesystems with the AS_WRITEBACK_INDETERMINATE flag set, writeback
operations may take an indeterminate time to complete.  For example,
writing data back to disk in FUSE filesystems depends on the userspace
server successfully completing writeback.

In this commit, wait_sb_inodes() skips waiting on writeback if the inode's
mapping has AS_WRITEBACK_INDETERMINATE set, else sync(2) may take an
indeterminate amount of time to complete.

If the caller wishes to ensure the data for a mapping with the
AS_WRITEBACK_INDETERMINATE flag set has actually been written back to
disk, they should use fsync(2)/fdatasync(2) instead.

Link: https://lkml.kernel.org/r/20241122232359.429647-4-joannelkoong@gmail.com
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:06 -08:00
Joanne Koong
26942ed372 mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts
Currently in shrink_folio_list(), reclaim for folios under writeback falls
into 3 different cases:

1) Reclaim is encountering an excessive number of folios under
   writeback and this folio has both the writeback and reclaim flags
   set
2) Dirty throttling is enabled (this happens if reclaim through cgroup
   is not enabled, if reclaim through cgroupv2 memcg is enabled, or
   if reclaim is on the root cgroup), or if the folio is not marked for
   immediate reclaim, or if the caller does not have __GFP_FS (or
   __GFP_IO if it's going to swap) set
3) Legacy cgroupv1 encounters a folio that already has the reclaim flag
   set and the caller did not have __GFP_FS (or __GFP_IO if swap) set

In cases 1) and 2), we activate the folio and skip reclaiming it while in
case 3), we wait for writeback to finish on the folio and then try to
reclaim the folio again.  In case 3, we wait on writeback because cgroupv1
does not have dirty folio throttling, as such this is a mitigation against
the case where there are too many folios in writeback with nothing else to
reclaim.

For filesystems where writeback may take an indeterminate amount of time
to write to disk, this has the possibility of stalling reclaim.

In this commit, if legacy memcg encounters a folio with the reclaim flag
set (eg case 3) and the folio belongs to a mapping that has the
AS_WRITEBACK_INDETERMINATE flag set, the folio will be activated and skip
reclaim (eg default to behavior in case 2) instead.

Link: https://lkml.kernel.org/r/20241122232359.429647-3-joannelkoong@gmail.com
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:06 -08:00
Joanne Koong
f43af7765e mm: add AS_WRITEBACK_INDETERMINATE mapping flag
Patch series "fuse: remove temp page copies in writeback", v6.

The purpose of this patchset is to help make writeback-cache write
performance in FUSE filesystems as fast as possible.

In the current FUSE writeback design (see commit 3be5a52b30 ("fuse:
support writable mmap"))), a temp page is allocated for every dirty page
to be written back, the contents of the dirty page are copied over to the
temp page, and the temp page gets handed to the server to write back. 
This is done so that writeback may be immediately cleared on the dirty
page, and this in turn is done for two reasons:

a) in order to mitigate the following deadlock scenario that may arise
   if reclaim waits on writeback on the dirty page to complete (more
   details can be found in this thread [1]):

   * single-threaded FUSE server is in the middle of handling a request
     that needs a memory allocation
   * memory allocation triggers direct reclaim
   * direct reclaim waits on a folio under writeback
   * the FUSE server can't write back the folio since it's stuck in
     direct reclaim

b) in order to unblock internal (eg sync, page compaction) waits on
   writeback without needing the server to complete writing back to disk,
   which may take an indeterminate amount of time.

Allocating and copying dirty pages to temp pages is the biggest
performance bottleneck for FUSE writeback.  This patchset aims to get rid
of the temp page altogether (which will also allow us to get rid of the
internal FUSE rb tree that is needed to keep track of writeback status on
the temp pages).  Benchmarks show approximately a 20% improvement in
throughput for 4k block-size writes and a 45% improvement for 1M
block-size writes.

With removing the temp page, writeback state is now only cleared on the
dirty page after the server has written it back to disk.  This may take an
indeterminate amount of time.  As well, there is also the possibility of
malicious or well-intentioned but buggy servers where writeback may in the
worst case scenario, never complete.  This means that any
folio_wait_writeback() on a dirty page belonging to a FUSE filesystem
needs to be carefully audited.

In particular, these are the cases that need to be accounted for:
* potentially deadlocking in reclaim, as mentioned above
* potentially stalling sync(2)
* potentially stalling page migration / compaction

This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
filesystems may set on its inode mappings to indicate that writeback
operations may take an indeterminate amount of time to complete.  FUSE
will set this flag on its mappings.  This patchset adds checks to the
critical parts of reclaim, sync, and page migration logic where writeback
may be waited on.

Please note the following:
* For sync(2), waiting on writeback will be skipped for FUSE, but this has no
  effect on existing behavior. Dirty FUSE pages are already not guaranteed to
  be written to disk by the time sync(2) returns (eg writeback is cleared on
  the dirty page but the server may not have written out the temp page to disk
  yet). If the caller wishes to ensure the data has actually been synced to
  disk, they should use fsync(2)/fdatasync(2) instead.
* AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be
  waited on when in writeback. There are some cases where the wait is
  desirable. For example, for the sync_file_range() syscall, it is fine to
  wait on the writeback since the caller passes in a fd for the operation.

[1] https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/


This patch (of 5):

Add a new mapping flag AS_WRITEBACK_INDETERMINATE which filesystems may
set to indicate that writing back to disk may take an indeterminate amount
of time to complete.  Extra caution should be taken when waiting on
writeback for folios belonging to mappings where this flag is set.

Link: https://lkml.kernel.org/r/20241122232359.429647-1-joannelkoong@gmail.com
Link: https://lkml.kernel.org/r/20241122232359.429647-2-joannelkoong@gmail.com
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:06 -08:00
Christoph Hellwig
969a9f8a78 mm: unexport apply_to_existing_page_range
apply_to_existing_page_range() is only used by non-modular code.

Link: https://lkml.kernel.org/r/20241212073423.1439954-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:05 -08:00
Andrew Morton
a6d825ca13 mm-fix-outdated-incorrect-code-comments-for-handle_mm_fault-fix
s/mmap_Lock/mmap_lock/, per Liam

Cc: Jinliang Zheng <alexjlzheng@gmail.com>
Cc: Jinliang Zheng <alexjlzheng@tencent.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:05 -08:00
Jinliang Zheng
e450da5cb2 mm: fix outdated incorrect code comments for handle_mm_fault()
Link: https://lkml.kernel.org/r/20241213031820.778342-1-alexjlzheng@tencent.com
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:05 -08:00
Uros Bizjak
cae376022e percpu/x86: enable strict percpu checks via named AS qualifiers
This patch declares percpu variables in __seg_gs/__seg_fs named AS and
keeps them named AS qualified until they are dereferenced with percpu
accessor.  This approach enables various compiler check for
cross-namespace variable assignments.

Link: https://lkml.kernel.org/r/20241208204708.3742696-7-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:05 -08:00
Uros Bizjak
7208ab5191 percpu: repurpose __percpu tag as a named address space qualifier
The patch introduces __percpu_qual define and repurposes __percpu tag as a
named address space qualifier using the new define.

Arches can now conditionally define __percpu_qual as their named address
space qualifier for percpu variables.

Link: https://lkml.kernel.org/r/20241208204708.3742696-6-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:04 -08:00
Uros Bizjak
d896334b8e percpu: use TYPEOF_UNQUAL() in *_cpu_ptr() accessors
Use TYPEOF_UNQUAL() macro to declare the return type of *_cpu_ptr()
accessors in the generic named address space to avoid access to data from
pointer to non-enclosed address space type of errors.

Link: https://lkml.kernel.org/r/20241208204708.3742696-5-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:04 -08:00
Uros Bizjak
9c4ba50565 percpu: use TYPEOF_UNQUAL() in variable declarations
Use TYPEOF_UNQUAL() to declare variables as a corresponding
type without named address space qualifier to avoid
"`__seg_gs' specified for auto variable `var'" errors.

Link: https://lkml.kernel.org/r/20241208204708.3742696-4-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:04 -08:00
Uros Bizjak
bf3b91c387 compiler.h: introduce TYPEOF_UNQUAL() macro
Define TYPEOF_UNQUAL() to use __typeof_unqual__() as typeof operator when
available, to return unqualified type of the expression.

Current version of sparse doesn't know anything about __typeof_unqual__()
operator.  Avoid the usage of __typeof_unqual__() when sparse checking is
active to prevent sparse errors with unknowing keyword.

Link: https://lkml.kernel.org/r/20241208204708.3742696-3-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:03 -08:00
Uros Bizjak
4497834701 x86/kgdb: use IS_ERR_PCPU() macro
Enable strict percpu address space checks via x86 named address space
qualifiers.  Percpu variables are declared in __seg_gs/__seg_fs named AS
and kept named AS qualified until they are dereferenced via percpu
accessor.  This approach enables various compiler checks for
cross-namespace variable assignments.

Please note that current version of sparse doesn't know anything about
__typeof_unqual__() operator.  Avoid the usage of __typeof_unqual__() when
sparse checking is active to prevent sparse errors with unknowing keyword.
The proposed patch by Dan Carpenter to implement __typeof_unqual__()
handling in sparse is located at:

https://lore.kernel.org/lkml/5b8d0dee-8fb6-45af-ba6c-7f74aff9a4b8@stanley.mountain/


This patch (of 6):

Use IS_ERR_PCPU() when checking the error pointer in the percpu address
space.  This macro adds intermediate cast to unsigned long when switching
named address spaces.

The patch will avoid future build errors due to pointer address space
mismatch with enabled strict percpu address space checks.

Link: https://lkml.kernel.org/r/20241208204708.3742696-1-ubizjak@gmail.com
Link: https://lkml.kernel.org/r/20241208204708.3742696-2-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:03 -08:00
Guo Weikang
d60b6c803c mm/early_ioremap: add null pointer checks to prevent NULL-pointer dereference
The early_ioremap interface can fail and return NULL in certain cases.  To
prevent NULL-pointer dereference crashes, fixed issues in the acpi_extlog
and copy_early_mem interfaces, improving robustness when handling early
memory.

Link: https://lkml.kernel.org/r/20241212101004.1544070-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:03 -08:00
Lorenzo Stoakes
2590d8e8f8 mm: add comments to do_mmap(), mmap_region() and vm_mmap()
It isn't always entirely clear to users the difference between do_mmap(),
mmap_region() and vm_mmap(), so add comments to clarify what's going on in
each.

This is compounded by the fact that we actually allow callers external to
mm to invoke both do_mmap() and mmap_region() (!), the latter of which is
really strictly speaking an internal memory mapping implementation detail.

Link: https://lkml.kernel.org/r/20241212113152.28849-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:03 -08:00
Lorenzo Stoakes
92743a959d mm: assert mmap write lock held on do_mmap(), mmap_region()
Both of these functions can be invoked outside of mm, so it is probably a
good idea to assert that the required lock is held.

Will only have an impact if CONFIG_DEBUG_VM is set, otherwise this amounts
to no change at all.

Link: https://lkml.kernel.org/r/20241212114841.55185-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:02 -08:00
Lorenzo Stoakes
a61646d78b MAINTAINERS: update MEMORY MAPPING section
Update the MEMORY MAPPING section to contain VMA logic as it makes no
sense to have these two sections separate.

Additionally, add files which permit changes to the attributes and/or
ranges spanned by memory mappings, in essence anything which might alter
the output of /proc/$pid/[s]maps.

This is necessarily fuzzy, as there is not quite as good separation of
concerns as we would ideally like in the kernel.  However each of these
files interacts with the VMA and memory mapping logic in such a way as to
be inseparatable from it, and it is important that they are maintained in
conjunction with it.

Link: https://lkml.kernel.org/r/20241211105315.21756-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:02 -08:00
Joshua Hahn
9ebaae011c memcg/hugetlb: remove memcg hugetlb try-commit-cancel protocol
This patch fully removes the mem_cgroup_{try, commit, cancel}_charge
functions, as well as their hugetlb variants.

Link: https://lkml.kernel.org/r/20241211203951.764733-4-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:02 -08:00
Joshua Hahn
abfe537a39 memcg/hugetlb: introduce mem_cgroup_charge_hugetlb
This patch introduces mem_cgroup_charge_hugetlb which combines the logic
of mem_cgroup_hugetlb_try_charge / mem_cgroup_hugetlb_commit_charge and
removes the need for mem_cgroup_hugetlb_cancel_charge.  It also reduces
the footprint of memcg in hugetlb code and consolidates all memcg related
error paths into one.

Link: https://lkml.kernel.org/r/20241211203951.764733-3-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:02 -08:00
Joshua Hahn
fbd3462b74 memcg/hugetlb: introduce memcg_accounts_hugetlb
Patch series "memcg/hugetlb: Rework memcg hugetlb charging", v3.

This series cleans up memcg's hugetlb charging logic by deprecating the
current memcg hugetlb try-charge + {commit, cancel} logic present in
alloc_hugetlb_folio.  A single function mem_cgroup_charge_hugetlb takes
its place instead.  This makes the code more maintainable by simplifying
the error path and reduces memcg's footprint in hugetlb logic.

This patch introduces a few changes in the hugetlb folio allocation
error path:
(a) Instead of having multiple return points, we consolidate them to
    two: one for reaching the memcg limit or running out of memory
    (-ENOMEM) and one for hugetlb allocation fails / limit being
    reached (-ENOSPC).
(b) Previously, the memcg limit was checked before the folio is acquired,
    meaning the hugeTLB folio isn't acquired if the limit is reached.
    This patch performs the charging after the folio is reached, meaning
    if memcg's limit is reached, the acquired folio is freed right away.

This patch builds on two earlier patch series: [2] which adds memcg
hugeTLB counters, and [3] which deprecates charge moving and removes the
last references to mem_cgroup_cancel_charge.  The request for this cleanup
can be found in [2].

[1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
[2] https://lore.kernel.org/all/20241101204402.1885383-1-joshua.hahnjy@gmail.com/
[3] https://lore.kernel.org/linux-mm/20241025012304.2473312-1-shakeel.butt@linux.dev/


This patch (of 3):

This patch isolates the check for whether memcg accounts hugetlb.  This
condition can only be true if the memcg mount option
memory_hugetlb_accounting is on, which includes hugetlb usage in
memory.current.

Link: https://lkml.kernel.org/r/20241211203951.764733-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20241211203951.764733-2-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:01 -08:00
Hyeonggon Yoo
b126895586 mm/migrate: remove slab checks in isolate_movable_page()
Commit 8b8817630a ("mm/migrate: make isolate_movable_page() skip slab
pages") introduced slab checks to prevent mis-identification of slab pages
as movable kernel pages.

However, after Matthew's frozen folio series, these slab checks became
unnecessary as the migration logic fails to increase the reference count
for frozen slab folios.  Remove these redundant slab checks and associated
memory barriers.

Link: https://lkml.kernel.org/r/20241210124807.8584-1-42.hyeyoo@gmail.com
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:01 -08:00
SeongJae Park
34e0ee4bce samples/damon/prcl: implement schemes setup
Implement a proactive cold memory regions reclaiming logic of prcl sample
module using DAMOS.  The logic treats memory regions that not accessed at
all for five or more seconds as cold, and reclaim those as soon as found.

Link: https://lkml.kernel.org/r/20241210215030.85675-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:01 -08:00
SeongJae Park
4670707828 samples/damon: introduce a skeleton of a smaple DAMON module for proactive reclamation
DAMON is not only for monitoring of access patterns, but also for
access-aware system operations.  For the system operations, DAMON provides
a feature called DAMOS (Data Access Monitoring-based Operation Schemes). 
There is no sample API usage of DAMOS, though.  Copy the working set size
estimation sample modules with changed names of the module and symbols, to
use it as a skeleton for a sample module showing the DAMOS API usage.  The
following commit will make it proactively reclaim cold memory of the given
process, using DAMOS.

Link: https://lkml.kernel.org/r/20241210215030.85675-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:01 -08:00
SeongJae Park
6c49b56bbc samples/damon/wsse: implement working set size estimation and logging
Implement the DAMON-based working set size estimation logic.  The logic
iterates memory regions in DAMON-generated access pattern snapshot for
every aggregation interval and get the total sum of the size of any region
having one or higher 'nr_accesses' count.  That is, it assumes any region
having one or higher 'nr_accesses' to be a part of the working set.  The
estimated value is reported to the user by printing it to the kernel log.

Link: https://lkml.kernel.org/r/20241210215030.85675-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:00 -08:00
SeongJae Park
d452d76f63 samples/damon/wsse: start and stop DAMON as the user requests
Start running DAMON to monitor accesses of a process that the user
specified via 'target_pid' parameter, when 'y' is passed to 'enable'
parameter.  Stop running DAMON when 'n' is passed to 'enable' parameter. 
Estimating the working set size from DAMON's monitoring results and
reporting it to the user will be implemented by the following commit.

Link: https://lkml.kernel.org/r/20241210215030.85675-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:00 -08:00
SeongJae Park
a307d7da1a samples: add a skeleton of a sample DAMON module for working set size estimation
Patch series "mm/damon: add sample modules".

Implement a proactive cold memory regions reclaiming logic of prcl sample
module using DAMOS.  The logic treats memory regions that not accessed at
all for five or more seconds as cold, and reclaim those as soon as found.


This patch (of 5):

Add a skeleton for a sample DAMON static module that can be used for
estimating working set size of a given process.  Note that it is a static
module since DAMON is not exporting symbols to loadable modules for now. 
It exposes two module parameters, namely 'pid' and 'enable'.  'pid' will
specify the process that the module will estimate the working set size of.
'enable' will receive whether to start or stop the estimation.  Because
this is just a skeleton, the parameters do nothing, though.  The
functionalities will be implemented by following commits.

Link: https://lkml.kernel.org/r/20241210215030.85675-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20241210215030.85675-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:00 -08:00
Andrew Morton
26cc2d4cbe mm-remove-an-avoidable-load-of-page-refcount-in-page_ref_add_unless-fix
add comment from David

Link: https://lkml.kernel.org/r/f5a65bf5-5105-4376-9c1c-164a15a4ab79@redhat.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:51:00 -08:00
Mateusz Guzik
0816a5f2db mm: remove an avoidable load of page refcount in page_ref_add_unless
Explicitly pre-checking the count adds nothing as atomic_add_unless starts
with doing the same thing.  iow no functional changes.

disasm of stock filemap_get_read_batch from perf top while running
readseek2_processes -t 24:

  0.04 │ cb:   mov    0x34(%rbx),%eax           # first load
 73.11 │       test   %eax,%eax
       │     ↓ je     1bd
  0.09 │       mov    0x34(%rbx),%eax           # second load
  1.01 │ d9:   test   %eax,%eax
       │     ↓ je     1bd
  0.06 │       lea    0x1(%rax),%edx
  0.00 │       lea    0x34(%rbx),%r14
  0.00 │       lock   cmpxchg %edx,0x34(%rbx)
 14.06 │     ↑ jne    d9

Link: https://lkml.kernel.org/r/20241207082931.1707465-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:59 -08:00
Yu Zhao
88b2a8d4ad mm/mglru: rework workingset protection
With the aging feedback no longer considering the distribution of folios
in each generation, rework workingset protection to better distribute
folios across MAX_NR_GENS.  This is achieved by reusing PG_workingset and
PG_referenced/LRU_REFS_FLAGS in a slightly different way.

For folios accessed multiple times through file descriptors, make
lru_gen_inc_refs() set additional bits of LRU_REFS_WIDTH in folio->flags
after PG_referenced, then PG_workingset after LRU_REFS_WIDTH.  After all
its bits are set, i.e., LRU_REFS_FLAGS|BIT(PG_workingset), a folio is
lazily promoted into the second oldest generation in the eviction path. 
And when folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that
lru_gen_inc_refs() can start over.  For this case, LRU_REFS_MASK is only
valid when PG_referenced is set.

For folios accessed multiple times through page tables, folio_update_gen()
from a page table walk or lru_gen_set_refs() from a rmap walk sets
PG_referenced after the accessed bit is cleared for the first time. 
Thereafter, those two paths set PG_workingset and promote folios to the
youngest generation.  Like folio_inc_gen(), when folio_update_gen() does
that, it also clears PG_referenced.  For this case, LRU_REFS_MASK is not
used.

For both of the cases, after PG_workingset is set on a folio, it remains
until this folio is either reclaimed, or "deactivated" by
lru_gen_clear_refs().  It can be set again if lru_gen_test_recent()
returns true upon a refault.

When adding folios to the LRU lists, lru_gen_distance() distributes
them as follows:
+---------------------------------+---------------------------------+
|    Accessed thru page tables    | Accessed thru file descriptors  |
+---------------------------------+---------------------------------+
| PG_active (set while isolated)  |                                 |
+----------------+----------------+----------------+----------------+
| PG_workingset  | PG_referenced  | PG_workingset  | LRU_REFS_FLAGS |
+---------------------------------+---------------------------------+
|<--------- MIN_NR_GENS --------->|                                 |
|<-------------------------- MAX_NR_GENS -------------------------->|

After this patch, some typical client and server workloads showed
improvements under heavy memory pressure.  For example, Python TPC-C,
which was used to benchmark a different approach [1] to better detect
refault distances, showed a significant decrease in total refaults:

                            Before      After      Change
  Time (seconds)            10801       10801      0%
  Executed (transactions)   41472       43663      +5%
  workingset_nodes          109070      120244     +10%
  workingset_refault_anon   5019627     7281831    +45%
  workingset_refault_file   1294678786  554855564  -57%
  workingset_refault_total  1299698413  562137395  -57%

[1] https://lore.kernel.org/20230920190244.16839-1-ryncsn@gmail.com/

Link: https://lkml.kernel.org/r/20241207221522.2250311-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Kairui Song <kasong@tencent.com>
Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Stevens <stevensd@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:59 -08:00
Yu Zhao
1bef50327e mm/mglru: rework refault detection
With anon and file min_seq being able to move independently, rework
workingset protection as well so that the comparison of refaults between
anon and file is always on an equal footing.

Specifically, make lru_gen_test_recent() return true for refaults
happening within the distance of MAX_NR_GENS.  For example, if min_seq of
a type is max_seq-MIN_NR_GENS, refaults from min_seq-1, i.e.,
max_seq-MIN_NR_GENS-1, are also considered recent, since the distance
max_seq-(max_seq-MIN_NR_GENS-1), i.e., MIN_NR_GENS+1 is less than
MAX_NR_GENS.

As an intermediate step to the final optimization, this change by itself
should not have userspace-visiable effects beyond performance.

Link: https://lkml.kernel.org/r/20241207221522.2250311-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Kairui Song <kasong@tencent.com>
Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Stevens <stevensd@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:59 -08:00
Yu Zhao
51793e247b mm/mglru: rework type selection
With anon and file min_seq being able to move independently, rework type
selection so that it is based on the total refaults from all tiers of each
type.  Also allow a type to be selected until that type reaches
MIN_NR_GENS, and therefore abs_diff(min_seq[0],min_seq[1]) now can be 2
(MAX_NR_GENS-MIN_NR_GENS) instead of 1.

Since some tiers of a selected type can have higher refaults than the
first tier of the other type, use a less larger gain factor 2:3 instead of
1:2, in order for those tiers in the selected type to be better protected.

As an intermediate step to the final optimization, this change by itself
should not have userspace-visiable effects beyond performance.

Link: https://lkml.kernel.org/r/20241207221522.2250311-5-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: David Stevens <stevensd@chromium.org>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:59 -08:00
Yu Zhao
258403f4bd mm/mglru: rework aging feedback
The aging feedback is based on both the number of generations and the
distribution of folios in each generation.  The number of generations is
currently the distance between max_seq and anon min_seq.  This is because
anon min_seq is not allowed to move past file min_seq.  The rationale for
that is that file is always evictable whereas anon is not.  However, for
use cases where anon is a lot cheaper than file:

1. Anon in the second oldest generation can be a better choice than
   file in the oldest generation.
2. A large amount of file in the oldest generation can skew the
   distribution, making should_run_aging() return false negative.

Allow anon and file min_seq to move independently, and use solely the
number of generations as the feedback for aging.  Specifically, when both
anon and file are evictable, anon min_seq can now be greater than file
min_seq, and therefore the number of generations becomes the distance
between max_seq and min(min_seq[0],min_seq[1]).  And should_run_aging()
returns true if and only if the number of generations is less than
MAX_NR_GENS.

As the first step to the final optimization, this change by itself
should not have userspace-visiable effects beyond performance. The
next twos patch will take advantage of this change; the last patch in
this series will better distribute folios across MAX_NR_GENS.

Link: https://lkml.kernel.org/r/20241207221522.2250311-4-yuzhao@google.com
Reported-by: David Stevens <stevensd@chromium.org>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:58 -08:00
Yu Zhao
e0003912bc mm/mglru: optimize deactivation
Do not shuffle a folio in the deactivation paths if it is already in the
oldest generation.  This reduces the LRU lock contention.

Before this patch, the contention is reproducible by FIO, e.g.,

  fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
      -rwmixwrite=30  --norandommap --randrepeat=0 -ioengine=sync \
      -bs=4k -numjobs=400 -runtime=25000 --time_based \
      -group_reporting -name=mglru

  98.96%--_raw_spin_lock_irqsave
          folio_lruvec_lock_irqsave
          |
           --98.78%--folio_batch_move_lru
               |
                --98.63%--deactivate_file_folio
                          mapping_try_invalidate
                          invalidate_mapping_pages
                          invalidate_bdev
                          blkdev_common_ioctl
                          blkdev_ioctl

After this patch, deactivate_file_folio() bails out early without taking
the LRU lock.

A side effect is that a folio can be left at the head of the oldest
generation, rather than the tail.  If reclaim happens at the same time, it
cannot reclaim this folio immediately.  Since there is no known
correlation between truncation and reclaim, this side effect is considered
insignificant.

Link: https://lkml.kernel.org/r/20241207221522.2250311-3-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Bharata B Rao <bharata@amd.com>
Closes: https://lore.kernel.org/CAOUHufawNerxqLm7L9Yywp3HJFiYVrYO26ePUb1jH-qxNGWzyA@mail.gmail.com/
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: David Stevens <stevensd@chromium.org>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:58 -08:00
Yu Zhao
fe09d21235 mm/mglru: clean up workingset
Patch series "mm/mglru: performance optimizations", v3.

This series improves performance for some previously reported test cases. 
Most of the code changes gathered here has been floating on the mailing
list [1][2].  They are now properly organized and have gone through
various benchmarks on client and server devices, including Android, FIO,
memcached, multiple VMs and MongoDB.

In addition to the warning [3] fixed in v2, this version fixes another
warning [4] reported by syzbot.

[1] https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/
[2] https://lore.kernel.org/CAOUHufawNerxqLm7L9Yywp3HJFiYVrYO26ePUb1jH-qxNGWzyA@mail.gmail.com/
[3] https://lore.kernel.org/67294349.050a0220.701a.0010.GAE@google.com/
[4] https://lore.kernel.org/67549eca.050a0220.2477f.001b.GAE@google.com/


This patch (of 6):

Move VM_BUG_ON_FOLIO() to cover both the default and MGLRU paths.  Also
use a pair of rcu_read_lock() and rcu_read_unlock() within each path, to
improve readability.

This change should not have any side effects.

Link: https://lkml.kernel.org/r/20241207221522.2250311-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20241207221522.2250311-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: David Stevens <stevensd@chromium.org>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:58 -08:00
Kevin Brodsky
05f536764b selftests/mm: remove X permission from sigaltstack mapping
There is no reason why the alternate signal stack should be mapped as RWX.
Map it as RW instead.

Link: https://lkml.kernel.org/r/20241209095019.1732120-15-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:58 -08:00
Kevin Brodsky
9f99e48a07 selftests/mm: skip pkey_sighandler_tests if support is missing
The pkey_sighandler_tests are bound to fail if either the kernel or CPU
doesn't support pkeys.  Skip the tests if pkeys support is missing.

Link: https://lkml.kernel.org/r/20241209095019.1732120-14-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:57 -08:00
Kevin Brodsky
88b584dfd5 selftests/mm: rename pkey register macro
PKEY_ALLOW_ALL is meant to represent the pkey register value that allows
all accesses (enables all pkeys).  However its current naming suggests
that the value applies to *one* key only (like PKEY_DISABLE_ACCESS for
instance).

Rename PKEY_ALLOW_ALL to PKEY_REG_ALLOW_ALL to avoid such
misunderstanding.  This is consistent with the PKEY_REG_ALLOW_NONE macro
introduced by commit 6e182dc9f2 ("selftests/mm: Use generic pkey
register manipulation").

Link: https://lkml.kernel.org/r/20241209095019.1732120-13-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:57 -08:00
Kevin Brodsky
d606b6894d selftests/mm: fix dependency on pkey_util.c
The pkey* files can only be built on architectures that support pkeys
(pkey-helpers.h #error's otherwise).  Adding pkey_util.c as dependency to
all $(TEST_GEN_FILES) is therefore a bad idea.  Make it a dependency of
the pkeys tests only.

Those tests are built in 32/64-bit variants on x86_64 so we need to add an
explicit dependency there as well.

Link: https://lkml.kernel.org/r/20241216092849.2140850-1-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:57 -08:00
Kevin Brodsky
8d795f0e9b selftests/mm: use sys_pkey helpers consistently
sys_pkey_alloc, sys_pkey_free and sys_mprotect_pkey are currently used in
protections_keys.c, while pkey_sighandler_tests.c calls the libc wrappers
directly (e.g.  pkey_mprotect()).  This is probably ok when using glibc
(those symbols appeared a while ago), but Musl does not currently provide
them.  The logging in the helpers from pkey-helpers.h can also come in
handy.

Make things more consistent by using the sys_pkey helpers in
pkey_sighandler_tests.c too.  To that end their implementation is moved to
a common .c file (pkey_util.c).  This also enables calling
is_pkeys_supported() outside of protections_keys.c, since it relies on
sys_pkey_{alloc,free}.

Link: https://lkml.kernel.org/r/20241209095019.1732120-12-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:57 -08:00
Kevin Brodsky
ef3bdbdaea selftests/mm: ensure non-global pkey symbols are marked static
The pkey tests define a whole lot of functions and some global variables. 
A few are truly global (declared in pkey-helpers.h), but the majority are
file-scoped.  Make sure those are labelled static.

Some of the pkey_{access,write}_{allow,deny} helpers are not called, or
only called when building for some architectures.  Mark them
__maybe_unused to suppress compiler warnings.

Link: https://lkml.kernel.org/r/20241209095019.1732120-11-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:56 -08:00
Kevin Brodsky
adae653ec8 selftests/mm: remove empty pkey helper definition
Some of the functions declared in pkey-helpers.h are actually defined in
protections_keys.c, meaning they can only be called from
protections_keys.c.  This is less than ideal, but it is hard to avoid as
these helpers are themselves called from inline functions in
pkey-<arch>.h.  Let's at least add a comment clarifying that.  We can also
remove the empty definition in pkey_sighandler_tests.c:
expected_pkey_fault() is not meant to be called from there.

Link: https://lkml.kernel.org/r/20241209095019.1732120-10-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:56 -08:00
Kevin Brodsky
cc5d6dfa88 selftests/mm: ensure pkey-*.h define inline functions only
Headers should not define non-inline functions, as this prevents them from
being included more than once in a given program.  pkey-helpers.h and the
arch-specific headers it includes currently define multiple such
non-inline functions.

In most cases those functions can simply be made inline - this patch does
just that.  read_ptr() is an exception as it must not be inlined.  Since
it is only called from protection_keys.c, we just move it there.

Link: https://lkml.kernel.org/r/20241209095019.1732120-9-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:56 -08:00
Kevin Brodsky
74a2592dd3 selftests/mm: define types using typedef in pkey-helpers.h
Using #define to define types should be avoided.  Use typedef instead. 
Also ensure that __u* types are actually defined by including
<linux/types.h>.

Link: https://lkml.kernel.org/r/20241209095019.1732120-8-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:56 -08:00