linux/fs
Kairui Song fb56fdf8b9 mm/list_lru: split the lock to per-cgroup scope
Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node.  This lock contention is heavy
when multiple cgroups modify the same list_lru.

This lock can be split into per-cgroup scope to reduce contention.

To achieve this, we need a stable list_lru_one for every cgroup.  This
commit adds a lock to each list_lru_one and introduced a helper function
lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg.
Then reworked the reparenting process.

Reparenting will switch the list_lru_one instances one by one.  By locking
each instance and marking it dead using the nr_items counter, reparenting
ensures that all items in the corresponding cgroup (on-list or not,
because items have a stable cgroup, see below) will see the list_lru_one
switch synchronously.

Objcg reparent is also moved after list_lru reparent so items will have a
stable mem cgroup until all list_lru_one instances are drained.

The only caller that doesn't work the *_obj interfaces are direct calls to
list_lru_{add,del}.  But it's only used by zswap and that's also based on
objcg, so it's fine.

This also changes the bahaviour of the isolation function when LRU_RETRY
or LRU_REMOVED_RETRY is returned, because now releasing the lock could
unblock reparenting and free the list_lru_one, isolation function will
have to return withoug re-lock the lru.

prepare() {
    mkdir /tmp/test-fs
    modprobe brd rd_nr=1 rd_size=33554432
    mkfs.xfs -f /dev/ram0
    mount -t xfs /dev/ram0 /tmp/test-fs
    for i in $(seq 1 512); do
        mkdir "/tmp/test-fs/$i"
        for j in $(seq 1 10240); do
            echo TEST-CONTENT > "/tmp/test-fs/$i/$j"
        done &
    done; wait
}

do_test() {
    read_worker() {
        sleep 1
        tar -cv "$1" &>/dev/null
    }
    read_in_all() {
        cd "/tmp/test-fs" && ls
        for i in $(seq 1 512); do
            (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
            read_worker "$i" &
        done; wait
    }
    for i in $(seq 1 512); do
        mkdir -p "/sys/fs/cgroup/benchmark/$i"
    done
    echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
    echo 512M > /sys/fs/cgroup/benchmark/memory.max
    echo 3 > /proc/sys/vm/drop_caches
    time read_in_all
}

Above script simulates compression of small files in multiple cgroups
with memory pressure. Run prepare() then do_test for 6 times:

Before:
real      0m7.762s user      0m11.340s sys       3m11.224s
real      0m8.123s user      0m11.548s sys       3m2.549s
real      0m7.736s user      0m11.515s sys       3m11.171s
real      0m8.539s user      0m11.508s sys       3m7.618s
real      0m7.928s user      0m11.349s sys       3m13.063s
real      0m8.105s user      0m11.128s sys       3m14.313s

After this commit (about ~15% faster):
real      0m6.953s user      0m11.327s sys       2m42.912s
real      0m7.453s user      0m11.343s sys       2m51.942s
real      0m6.916s user      0m11.269s sys       2m43.957s
real      0m6.894s user      0m11.528s sys       2m45.346s
real      0m6.911s user      0m11.095s sys       2m43.168s
real      0m6.773s user      0m11.518s sys       2m40.774s

Link: https://lkml.kernel.org/r/20241104175257.60853-6-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:22:26 -08:00
..
9p Revert patches causing inode collision problems 2024-10-25 15:25:02 -07:00
adfs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
affs affs-for-6.12-tag 2024-09-16 13:07:59 +02:00
afs vfs-6.12-rc6.fixes 2024-11-01 07:37:10 -10:00
autofs autofs: fix thinko in validate_dev_ioctl() 2024-10-28 13:16:56 +01:00
bcachefs bcachefs fixes for 6.12-rc6 2024-11-01 07:21:03 -10:00
befs befs: Convert befs_symlink_read_folio() to use folio_end_read() 2024-05-31 12:31:39 +02:00
bfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
btrfs for-6.12-rc5-tag 2024-11-01 07:31:47 -10:00
cachefiles cachefiles: fix dentry leak in cachefiles_open_file() 2024-09-27 18:29:19 +02:00
ceph A fix from Patrick for a variety of CephFS lockup scenarios caused by 2024-10-04 10:10:23 -07:00
coda coda: use param->file for FSCONFIG_SET_FD 2024-08-19 13:45:03 +02:00
configfs fs/configfs: Add a callback to determine attribute visibility 2024-06-17 20:42:57 +02:00
cramfs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
crypto move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
debugfs [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
devpts
dlm [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
ecryptfs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
efivarfs [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
efs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
erofs erofs: use get_tree_bdev_flags() to avoid misleading messages 2024-10-21 14:30:27 +02:00
exfat move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
exportfs fhandle: relax open_by_handle_at() permission checks 2024-05-28 15:57:23 +02:00
ext2 vfs-6.12.file 2024-09-16 09:14:02 +02:00
ext4 ext4: fix off by one issue in alloc_flex_gd() 2024-10-04 17:36:28 -04:00
f2fs f2fs: allow parallel DIO reads 2024-10-11 15:12:07 +00:00
fat fat: fix uninitialized variable 2024-10-17 00:28:06 -07:00
freevxfs freevxfs: Convert freevxfs to the new mount API. 2024-03-26 09:04:53 +01:00
fuse fuse: remove stray debug line 2024-10-25 17:05:49 +02:00
gfs2 gfs2 changes 2024-09-23 11:55:17 -07:00
hfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
hfsplus move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
hostfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
hpfs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
hugetlbfs mm: consolidate common checks in hugetlb_get_unmapped_area 2024-11-06 20:11:10 -08:00
iomap vfs-6.12-rc6.iomap 2024-11-01 07:45:00 -10:00
isofs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
jbd2 jbd2: remove unneeded check of ret in jbd2_fc_get_buf 2024-08-26 23:49:15 -04:00
jffs2 jffs2: Use a folio in jffs2_garbage_collect_dnode() 2024-08-19 13:40:00 +02:00
jfs jfs: Fix sanity check in dbMount 2024-10-22 09:40:37 -05:00
kernfs kernfs: mount: Remove unnecessary ‘NULL’ values from knparent 2024-05-04 19:02:39 +02:00
lockd move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
minix buffer: Convert __block_write_begin() to take a folio 2024-08-07 11:33:36 +02:00
netfs netfs: Downgrade i_rwsem for a buffered write 2024-10-17 15:33:42 +02:00
nfs NFS: remove revoked delegation from server's delegation list 2024-10-09 15:39:22 -04:00
nfs_common nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() 2024-10-03 16:19:43 -04:00
nfsd nfsd-6.12 fixes: 2024-11-02 09:27:11 -10:00
nilfs2 nilfs2: fix potential deadlock with newly created symlinks 2024-10-30 20:14:12 -07:00
nls move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
notify inotify: Fix possible deadlock in fsnotify_destroy_mark 2024-10-02 15:14:29 +02:00
ntfs3 Changes for 6.12-rc3 2024-10-08 10:53:06 -07:00
ocfs2 ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove() 2024-11-07 14:14:59 -08:00
omfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
openpromfs openpromfs: add missing MODULE_DESCRIPTION() macro 2024-06-20 09:46:01 +02:00
orangefs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
overlayfs fs: pass offset and result to backing_file end_write() callback 2024-10-16 13:17:45 +02:00
proc mm: zswap: modify zswap_stored_pages to be atomic_long_t 2024-11-11 00:26:42 -08:00
pstore drm next for 6.12-rc1 2024-09-19 10:18:15 +02:00
qnx4 qnx4: add MODULE_DESCRIPTION() 2024-05-28 11:52:53 +02:00
qnx6 qnx6: Convert directory handling to use kmap_local 2024-08-07 11:31:56 +02:00
quota \n 2024-09-23 10:49:28 -07:00
ramfs mm: switch mm->get_unmapped_area() to a flag 2024-04-25 20:56:25 -07:00
reiserfs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
romfs romfs: fix romfs_read_folio() 2024-08-21 22:32:58 +02:00
smb cifs: fix warning when destroy 'cifs_io_request_pool' 2024-10-23 07:42:44 -05:00
squashfs Squashfs: fix variable overflow in squashfs_readpage_block 2024-10-30 20:14:12 -07:00
sysfs Merge 6.9-rc5 into driver-core-next 2024-04-23 13:27:43 +02:00
sysv buffer: Convert __block_write_begin() to take a folio 2024-08-07 11:33:36 +02:00
tests execve: Move KUnit tests to tests/ subdirectory 2024-07-22 18:25:47 -07:00
tracefs eventfs: Use list_del_rcu() for SRCU protected list variable 2024-09-05 10:18:48 -04:00
ubifs [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
udf udf: fix uninit-value use in udf_get_fileshortad 2024-10-02 14:32:37 +02:00
ufs ufs_rename(): fix bogus argument of folio_release_kmap() 2024-10-02 00:05:09 -04:00
unicode unicode: Don't special case ignorable code points 2024-10-09 13:34:01 -04:00
vboxsf fs: Convert aops->write_end to take a folio 2024-08-07 11:32:02 +02:00
verity fsverity: expose verified fsverity built-in signatures to LSMs 2024-08-20 14:03:18 -04:00
xfs mm/list_lru: split the lock to per-cgroup scope 2024-11-11 17:22:26 -08:00
zonefs zonefs fixes for 6.12-rc2 2024-10-02 12:02:15 -07:00
aio.c fs/aio: Fix __percpu annotation of *cpu pointer in struct kioctx 2024-08-19 13:45:03 +02:00
anon_inodes.c fs: Create anon_inode_getfile_fmode() 2024-04-26 10:33:05 +02:00
attr.c nfsd-6.11 fixes: 2024-08-29 06:20:44 +12:00
backing-file.c fs: pass offset and result to backing_file end_write() callback 2024-10-16 13:17:45 +02:00
bad_inode.c
binfmt_elf_fdpic.c binfmt_elf_fdpic: fix AUXV size calculation when ELF_HWCAP2 is defined 2024-08-26 13:00:38 -07:00
binfmt_elf.c Revert "binfmt_elf, coredump: Log the reason of the failed core dumps" 2024-09-26 11:39:02 -07:00
binfmt_flat.c move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
binfmt_misc.c vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
binfmt_script.c fs: binfmt: add missing MODULE_DESCRIPTION() macros 2024-05-28 12:06:51 +02:00
bpf_fs_kfuncs.c bpf: Add kfunc bpf_get_dentry_xattr() to read xattr from dentry 2024-08-07 11:26:54 -07:00
buffer.c memcg-v1: no need for memcg locking for dirty tracking 2024-11-06 20:11:19 -08:00
char_dev.c
compat_binfmt_elf.c
coredump.c Revert "binfmt_elf, coredump: Log the reason of the failed core dumps" 2024-09-26 11:39:02 -07:00
d_path.c
dax.c fsdax: dax_unshare_iter needs to copy entire blocks 2024-10-07 13:51:47 +02:00
dcache.c vfs-6.12.misc 2024-09-16 08:35:09 +02:00
direct-io.c fs/direct-io: Remove linux/prefetch.h include 2024-08-19 13:45:02 +02:00
drop_caches.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
eventfd.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
eventpoll.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
exec.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
fcntl.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
fhandle.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
file_table.c slab updates for 6.12 2024-09-18 08:53:53 +02:00
file.c close_range(): fix the logics in descriptor table trimming 2024-09-29 21:52:29 -04:00
filesystems.c
fs_context.c
fs_parser.c fs_parse: add uid & gid option option parsing helpers 2024-07-02 06:20:49 +02:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c inode: port __I_SYNC to var event 2024-08-30 08:22:39 +02:00
fsopen.c [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
init.c
inode.c mm/list_lru: split the lock to per-cgroup scope 2024-11-11 17:22:26 -08:00
internal.h file: reclaim 24 bytes from f_owner 2024-08-28 13:05:39 +02:00
ioctl.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
Kconfig nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT 2024-10-03 16:19:51 -04:00
Kconfig.binfmt exec: Add KUnit test for bprm_stack_limits() 2024-06-19 13:13:55 -07:00
kernel_read_file.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
libfs.c vfs-6.12.folio 2024-09-16 08:54:30 +02:00
locks.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
Makefile bpf: introduce new VFS based BPF kfuncs 2024-08-06 09:01:41 -07:00
mbcache.c
mnt_idmapping.c fuse update for 6.12 2024-09-24 15:29:42 -07:00
mount.h vfs-6.12.mount 2024-09-16 11:15:26 +02:00
mpage.c buffer: Remove calls to set and clear the folio error flag 2024-05-31 12:31:43 +02:00
namei.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
namespace.c fs: don't try and remove empty rbtree node 2024-10-17 15:33:43 +02:00
nsfs.c [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
open.c openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) 2024-10-10 12:09:03 +02:00
pidfs.c pidfs: check for valid pid namespace 2024-09-27 18:29:19 +02:00
pipe.c [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
pnode.c
pnode.h
posix_acl.c fs: Use in_group_or_capable() helper to simplify the code 2024-08-30 08:22:37 +02:00
proc_namespace.c fs: rename show_mnt_opts -> show_vfsmnt_opts 2024-06-28 14:36:43 +02:00
read_write.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
readdir.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
remap_range.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
select.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
seq_file.c seq_file: Simplify __seq_puts() 2024-05-02 16:28:20 +02:00
signalfd.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
splice.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
stack.c
stat.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
statfs.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
super.c fs/super.c: introduce get_tree_bdev_flags() 2024-10-21 14:30:26 +02:00
sync.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
sysctls.c
timerfd.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
userfaultfd.c fork: do not invoke uffd on fork if error occurs 2024-10-28 21:40:38 -07:00
utimes.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
xattr.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00