2019-05-19 13:08:55 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* linux/mm/filemap.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 1994-1999 Linus Torvalds
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This file handles the generic file mmap semantics used by
|
|
|
|
* most "normal" filesystems (but you don't /have/ to use this:
|
|
|
|
* the NFS filesystem used to do this differently, for example)
|
|
|
|
*/
|
2011-10-16 02:01:52 -04:00
|
|
|
#include <linux/export.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/compiler.h>
|
2016-01-22 15:10:40 -08:00
|
|
|
#include <linux/dax.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/fs.h>
|
2017-02-08 18:51:30 +01:00
|
|
|
#include <linux/sched/signal.h>
|
2006-06-23 02:04:16 -07:00
|
|
|
#include <linux/uaccess.h>
|
2006-01-11 12:17:46 -08:00
|
|
|
#include <linux/capability.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/kernel_stat.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
|
|
|
#include <linux/gfp.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/swap.h>
|
2022-01-21 22:10:46 -08:00
|
|
|
#include <linux/swapops.h>
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
#include <linux/syscalls.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/mman.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/uio.h>
|
2019-05-13 17:21:04 -07:00
|
|
|
#include <linux/error-injection.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/hash.h>
|
|
|
|
#include <linux/writeback.h>
|
2007-10-18 14:47:32 -07:00
|
|
|
#include <linux/backing-dev.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/pagevec.h>
|
|
|
|
#include <linux/security.h>
|
2006-03-24 03:16:04 -08:00
|
|
|
#include <linux/cpuset.h>
|
mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.
Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:
mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.
mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.
mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.
As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().
[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 14:19:20 -07:00
|
|
|
#include <linux/hugetlb.h>
|
2008-02-07 00:13:53 -08:00
|
|
|
#include <linux/memcontrol.h>
|
2017-11-15 17:37:41 -08:00
|
|
|
#include <linux/shmem_fs.h>
|
2014-04-07 15:37:19 -07:00
|
|
|
#include <linux/rmap.h>
|
2018-10-26 15:06:08 -07:00
|
|
|
#include <linux/delayacct.h>
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:06:27 -07:00
|
|
|
#include <linux/psi.h>
|
2019-10-18 20:20:20 -07:00
|
|
|
#include <linux/ramfs.h>
|
2020-08-06 23:19:55 -07:00
|
|
|
#include <linux/page_idle.h>
|
2022-01-21 22:10:46 -08:00
|
|
|
#include <linux/migrate.h>
|
2023-02-14 15:01:42 +00:00
|
|
|
#include <linux/pipe_fs_i.h>
|
|
|
|
#include <linux/splice.h>
|
2023-12-15 15:51:54 -05:00
|
|
|
#include <linux/rcupdate_wait.h>
|
mm: allow read-ahead with IOCB_NOWAIT set
Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9
("mm: allow read-ahead with IOCB_NOWAIT set"). However, this
implementation broke the semantics of IOCB_NOWAIT by potentially causing
it to wait on I/O during memory reclamation. This behavior was later
modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO").
To resolve the blocking issue during memory reclamation, we can use
memalloc_noio_{save,restore} to ensure non-blocking behavior. This change
restores the original functionality, allowing preadv2(IOCB_NOWAIT) to
trigger readahead if the file content is not present in the page cache.
While this process may trigger direct memory reclamation, the
__GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't
block.
A use case for this change is when we want to trigger readahead in the
preadv2(2) syscall if the file cache is absent, but without waiting for
certain filesystem locks, like xfs_ilock. A simple example is as follows:
retry:
if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
do_other_work();
goto retry;
}
Link: https://lore.gnuweeb.org/io-uring/20200624164127.GP21350@casper.infradead.org/
Link: https://lkml.kernel.org/r/20240820022639.89562-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-20 10:26:39 +08:00
|
|
|
#include <linux/sched/mm.h>
|
2020-12-19 15:19:23 +03:00
|
|
|
#include <asm/pgalloc.h>
|
2021-02-10 11:15:11 +00:00
|
|
|
#include <asm/tlbflush.h>
|
2006-03-22 00:08:33 -08:00
|
|
|
#include "internal.h"
|
|
|
|
|
2013-04-29 15:06:10 -07:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/filemap.h>
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* FIXME: remove all knowledge of the buffer layer from the core VM
|
|
|
|
*/
|
2009-08-17 19:52:36 +02:00
|
|
|
#include <linux/buffer_head.h> /* for try_to_free_buffers */
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
#include <asm/mman.h>
|
|
|
|
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
#include "swap.h"
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* Shared mappings implemented 30.11.1994. It's not fully working yet,
|
|
|
|
* though.
|
|
|
|
*
|
|
|
|
* Shared mappings now work. 15.8.1995 Bruno.
|
|
|
|
*
|
|
|
|
* finished 'unifying' the page and buffer cache and SMP-threaded the
|
|
|
|
* page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com>
|
|
|
|
*
|
|
|
|
* SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de>
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lock ordering:
|
|
|
|
*
|
2014-12-12 16:54:24 -08:00
|
|
|
* ->i_mmap_rwsem (truncate_pagecache)
|
2022-02-09 20:22:12 +00:00
|
|
|
* ->private_lock (__free_pte->block_dirty_folio)
|
[PATCH] swap: swap_lock replace list+device
The idea of a swap_device_lock per device, and a swap_list_lock over them all,
is appealing; but in practice almost every holder of swap_device_lock must
already hold swap_list_lock, which defeats the purpose of the split.
The only exceptions have been swap_duplicate, valid_swaphandles and an
untrodden path in try_to_unuse (plus a few places added in this series).
valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
demand attention. However, with the hold time in get_swap_pages so much
reduced, I've not yet found a load and set of swap device priorities to show
even swap_duplicate benefitting from the split. Certainly the split is mere
overhead in the common case of a single swap device.
So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
(generally we seem to prefer an _ in the name, and not hide in a macro).
If someone can show a regression in swap_duplicate, then probably we should
add a hashlock for the swap_map entries alone (shorts being anatomic), so as
to help the case of the single swap device too.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-03 15:54:41 -07:00
|
|
|
* ->swap_lock (exclusive_swap_page, others)
|
2018-04-10 16:36:56 -07:00
|
|
|
* ->i_pages lock
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2021-04-12 15:50:21 +02:00
|
|
|
* ->i_rwsem
|
2021-01-28 19:19:45 +01:00
|
|
|
* ->invalidate_lock (acquired by fs in truncate path)
|
|
|
|
* ->i_mmap_rwsem (truncate->unmap_mapping_range)
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2020-06-08 21:33:54 -07:00
|
|
|
* ->mmap_lock
|
2014-12-12 16:54:24 -08:00
|
|
|
* ->i_mmap_rwsem
|
2005-10-29 18:16:41 -07:00
|
|
|
* ->page_table_lock or pte_lock (various, mainly in memory.c)
|
2018-04-10 16:36:56 -07:00
|
|
|
* ->i_pages lock (arch-dependent flush_dcache_mmap_lock)
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2020-06-08 21:33:54 -07:00
|
|
|
* ->mmap_lock
|
2021-01-28 19:19:45 +01:00
|
|
|
* ->invalidate_lock (filemap_fault)
|
|
|
|
* ->lock_page (filemap_fault, access_process_vm)
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2021-04-12 15:50:21 +02:00
|
|
|
* ->i_rwsem (generic_perform_write)
|
2021-08-02 13:44:20 +02:00
|
|
|
* ->mmap_lock (fault_in_readable->do_page_fault)
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2011-04-21 18:19:44 -06:00
|
|
|
* bdi->wb.list_lock
|
2011-03-22 22:23:41 +11:00
|
|
|
* sb_lock (fs/fs-writeback.c)
|
2018-04-10 16:36:56 -07:00
|
|
|
* ->i_pages lock (__sync_single_inode)
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2014-12-12 16:54:24 -08:00
|
|
|
* ->i_mmap_rwsem
|
2023-01-20 11:26:49 -05:00
|
|
|
* ->anon_vma.lock (vma_merge)
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
|
|
|
* ->anon_vma.lock
|
2005-10-29 18:16:41 -07:00
|
|
|
* ->page_table_lock or pte_lock (anon_vma_prepare and various)
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2005-10-29 18:16:41 -07:00
|
|
|
* ->page_table_lock or pte_lock
|
[PATCH] swap: swap_lock replace list+device
The idea of a swap_device_lock per device, and a swap_list_lock over them all,
is appealing; but in practice almost every holder of swap_device_lock must
already hold swap_list_lock, which defeats the purpose of the split.
The only exceptions have been swap_duplicate, valid_swaphandles and an
untrodden path in try_to_unuse (plus a few places added in this series).
valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
demand attention. However, with the hold time in get_swap_pages so much
reduced, I've not yet found a load and set of swap device priorities to show
even swap_duplicate benefitting from the split. Certainly the split is mere
overhead in the common case of a single swap device.
So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
(generally we seem to prefer an _ in the name, and not hide in a macro).
If someone can show a regression in swap_duplicate, then probably we should
add a hashlock for the swap_map entries alone (shorts being anatomic), so as
to help the case of the single swap device too.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-03 15:54:41 -07:00
|
|
|
* ->swap_lock (try_to_unmap_one)
|
2005-04-16 15:20:36 -07:00
|
|
|
* ->private_lock (try_to_unmap_one)
|
2018-04-10 16:36:56 -07:00
|
|
|
* ->i_pages lock (try_to_unmap_one)
|
2024-08-02 17:55:23 +02:00
|
|
|
* ->lruvec->lru_lock (follow_page_mask->mark_page_accessed)
|
2024-08-26 14:58:13 +08:00
|
|
|
* ->lruvec->lru_lock (check_pte_range->folio_isolate_lru)
|
2023-12-20 23:44:56 +01:00
|
|
|
* ->private_lock (folio_remove_rmap_pte->set_page_dirty)
|
|
|
|
* ->i_pages lock (folio_remove_rmap_pte->set_page_dirty)
|
|
|
|
* bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty)
|
|
|
|
* ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty)
|
|
|
|
* ->memcg->move_lock (folio_remove_rmap_pte->folio_memcg_lock)
|
2011-04-21 18:19:44 -06:00
|
|
|
* bdi.wb->list_lock (zap_pte_range->set_page_dirty)
|
2011-03-22 22:23:36 +11:00
|
|
|
* ->inode->i_lock (zap_pte_range->set_page_dirty)
|
2022-02-09 20:22:12 +00:00
|
|
|
* ->private_lock (zap_pte_range->block_dirty_folio)
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
|
|
|
|
2024-02-19 07:27:09 +01:00
|
|
|
static void mapping_set_update(struct xa_state *xas,
|
|
|
|
struct address_space *mapping)
|
|
|
|
{
|
|
|
|
if (dax_mapping(mapping) || shmem_mapping(mapping))
|
|
|
|
return;
|
|
|
|
xas_set_update(xas, workingset_update_node);
|
|
|
|
xas_set_lru(xas, &shadow_nodes);
|
|
|
|
}
|
|
|
|
|
2017-11-21 09:17:59 -05:00
|
|
|
static void page_cache_delete(struct address_space *mapping,
|
2021-05-08 00:35:49 -04:00
|
|
|
struct folio *folio, void *shadow)
|
2014-04-03 14:47:49 -07:00
|
|
|
{
|
2021-05-08 00:35:49 -04:00
|
|
|
XA_STATE(xas, &mapping->i_pages, folio->index);
|
|
|
|
long nr = 1;
|
2016-12-12 16:43:17 -08:00
|
|
|
|
2017-11-21 09:17:59 -05:00
|
|
|
mapping_set_update(&xas, mapping);
|
2016-12-12 16:43:17 -08:00
|
|
|
|
2023-09-26 12:20:17 -07:00
|
|
|
xas_set_order(&xas, folio->index, folio_order(folio));
|
|
|
|
nr = folio_nr_pages(folio);
|
2014-04-03 14:47:49 -07:00
|
|
|
|
2021-05-08 00:35:49 -04:00
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
mm: keep page cache radix tree nodes in check
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.
To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.
A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:
1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 14:47:56 -07:00
|
|
|
|
2017-11-21 09:17:59 -05:00
|
|
|
xas_store(&xas, shadow);
|
|
|
|
xas_init_marks(&xas);
|
mm: filemap: don't plant shadow entries without radix tree node
When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:
kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100
This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.
Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.
Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.
Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-04 22:02:08 +02:00
|
|
|
|
2021-05-08 00:35:49 -04:00
|
|
|
folio->mapping = NULL;
|
2017-11-15 17:37:26 -08:00
|
|
|
/* Leave page->index set: truncation lookup relies upon it */
|
mm: filemap: don't plant shadow entries without radix tree node
When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:
kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100
This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.
Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.
Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.
Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-04 22:02:08 +02:00
|
|
|
mapping->nrpages -= nr;
|
2014-04-03 14:47:49 -07:00
|
|
|
}
|
|
|
|
|
2021-05-08 20:04:05 -04:00
|
|
|
static void filemap_unaccount_folio(struct address_space *mapping,
|
|
|
|
struct folio *folio)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2021-05-08 20:04:05 -04:00
|
|
|
long nr;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2021-05-08 20:04:05 -04:00
|
|
|
VM_BUG_ON_FOLIO(folio_mapped(folio), folio);
|
|
|
|
if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(folio_mapped(folio))) {
|
mm: __delete_from_page_cache show Bad page if mapped
Commit e1534ae95004 ("mm: differentiate page_mapped() from
page_mapcount() for compound pages") changed the famous
BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
CONFIG_DEBUG_VM=y, but nothing at all when not.
Although it has not usually been very helpul, being hit long after the
error in question, we do need to know if it actually happens on users'
systems; but reinstating a crash there is likely to be opposed :)
In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
but that seems to be the standard procedure now. Move that, or the
VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
unNULLified page->mapping gives a little more information.
If the inode is being evicted (rather than truncated), it won't have any
vmas left, so it's safe(ish) to assume that the raised mapcount is
erroneous, and we can discount it from page_count to avoid leaking the
page (I'm less worried by leaking the occasional 4kB, than losing a
potential 2MB page with each 4kB page leaked).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 14:08:07 -08:00
|
|
|
pr_alert("BUG: Bad page cache in process %s pfn:%05lx\n",
|
2021-05-08 20:04:05 -04:00
|
|
|
current->comm, folio_pfn(folio));
|
|
|
|
dump_page(&folio->page, "still mapped when deleted");
|
mm: __delete_from_page_cache show Bad page if mapped
Commit e1534ae95004 ("mm: differentiate page_mapped() from
page_mapcount() for compound pages") changed the famous
BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
CONFIG_DEBUG_VM=y, but nothing at all when not.
Although it has not usually been very helpul, being hit long after the
error in question, we do need to know if it actually happens on users'
systems; but reinstating a crash there is likely to be opposed :)
In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
but that seems to be the standard procedure now. Move that, or the
VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
unNULLified page->mapping gives a little more information.
If the inode is being evicted (rather than truncated), it won't have any
vmas left, so it's safe(ish) to assume that the raised mapcount is
erroneous, and we can discount it from page_count to avoid leaking the
page (I'm less worried by leaking the occasional 4kB, than losing a
potential 2MB page with each 4kB page leaked).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 14:08:07 -08:00
|
|
|
dump_stack();
|
|
|
|
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
|
|
|
|
|
2022-03-24 18:09:52 -07:00
|
|
|
if (mapping_exiting(mapping) && !folio_test_large(folio)) {
|
2024-04-09 21:22:56 +02:00
|
|
|
int mapcount = folio_mapcount(folio);
|
2022-03-24 18:09:52 -07:00
|
|
|
|
|
|
|
if (folio_ref_count(folio) >= mapcount + 2) {
|
|
|
|
/*
|
|
|
|
* All vmas have already been torn down, so it's
|
|
|
|
* a good bet that actually the page is unmapped
|
|
|
|
* and we'd rather not leak it: if we're wrong,
|
|
|
|
* another bad page check should catch it later.
|
|
|
|
*/
|
2024-05-29 13:19:03 +02:00
|
|
|
atomic_set(&folio->_mapcount, -1);
|
2022-03-24 18:09:52 -07:00
|
|
|
folio_ref_sub(folio, mapcount);
|
|
|
|
}
|
mm: __delete_from_page_cache show Bad page if mapped
Commit e1534ae95004 ("mm: differentiate page_mapped() from
page_mapcount() for compound pages") changed the famous
BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
CONFIG_DEBUG_VM=y, but nothing at all when not.
Although it has not usually been very helpul, being hit long after the
error in question, we do need to know if it actually happens on users'
systems; but reinstating a crash there is likely to be opposed :)
In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
but that seems to be the standard procedure now. Move that, or the
VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
unNULLified page->mapping gives a little more information.
If the inode is being evicted (rather than truncated), it won't have any
vmas left, so it's safe(ish) to assume that the raised mapcount is
erroneous, and we can discount it from page_count to avoid leaking the
page (I'm less worried by leaking the occasional 4kB, than losing a
potential 2MB page with each 4kB page leaked).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 14:08:07 -08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-05-08 20:04:05 -04:00
|
|
|
/* hugetlb folios do not participate in page cache accounting. */
|
|
|
|
if (folio_test_hugetlb(folio))
|
2017-11-15 17:37:29 -08:00
|
|
|
return;
|
2017-07-10 15:47:35 -07:00
|
|
|
|
2021-05-08 20:04:05 -04:00
|
|
|
nr = folio_nr_pages(folio);
|
2017-11-15 17:37:29 -08:00
|
|
|
|
2021-05-08 20:04:05 -04:00
|
|
|
__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
|
|
|
|
if (folio_test_swapbacked(folio)) {
|
|
|
|
__lruvec_stat_mod_folio(folio, NR_SHMEM, -nr);
|
|
|
|
if (folio_test_pmd_mappable(folio))
|
|
|
|
__lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
|
|
|
|
} else if (folio_test_pmd_mappable(folio)) {
|
|
|
|
__lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
|
2019-09-23 15:38:03 -07:00
|
|
|
filemap_nr_thps_dec(mapping);
|
2016-07-26 15:26:18 -07:00
|
|
|
}
|
2017-11-15 17:37:29 -08:00
|
|
|
|
|
|
|
/*
|
2021-05-08 20:04:05 -04:00
|
|
|
* At this point folio must be either written or cleaned by
|
|
|
|
* truncate. Dirty folio here signals a bug and loss of
|
2022-03-24 18:13:59 -07:00
|
|
|
* unwritten data - on ordinary filesystems.
|
2017-11-15 17:37:29 -08:00
|
|
|
*
|
2022-03-24 18:13:59 -07:00
|
|
|
* But it's harmless on in-memory filesystems like tmpfs; and can
|
|
|
|
* occur when a driver which did get_user_pages() sets page dirty
|
|
|
|
* before putting it, while the inode is being finally evicted.
|
|
|
|
*
|
|
|
|
* Below fixes dirty accounting after removing the folio entirely
|
2021-05-08 20:04:05 -04:00
|
|
|
* but leaves the dirty flag set: it has no effect for truncated
|
|
|
|
* folio and anyway will be cleared before returning folio to
|
2017-11-15 17:37:29 -08:00
|
|
|
* buddy allocator.
|
|
|
|
*/
|
2022-03-24 18:13:59 -07:00
|
|
|
if (WARN_ON_ONCE(folio_test_dirty(folio) &&
|
|
|
|
mapping_can_writeback(mapping)))
|
|
|
|
folio_account_cleaned(folio, inode_to_wb(mapping->host));
|
2017-11-15 17:37:29 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Delete a page from the page cache and free it. Caller has to make
|
|
|
|
* sure the page is locked and that nobody else uses it - or that usage
|
2018-04-10 16:36:56 -07:00
|
|
|
* is safe. The caller must hold the i_pages lock.
|
2017-11-15 17:37:29 -08:00
|
|
|
*/
|
2021-05-09 09:33:42 -04:00
|
|
|
void __filemap_remove_folio(struct folio *folio, void *shadow)
|
2017-11-15 17:37:29 -08:00
|
|
|
{
|
2021-05-09 09:33:42 -04:00
|
|
|
struct address_space *mapping = folio->mapping;
|
2017-11-15 17:37:29 -08:00
|
|
|
|
2021-07-23 09:29:46 -04:00
|
|
|
trace_mm_filemap_delete_from_page_cache(folio);
|
2021-05-08 20:04:05 -04:00
|
|
|
filemap_unaccount_folio(mapping, folio);
|
2021-05-08 00:35:49 -04:00
|
|
|
page_cache_delete(mapping, folio, shadow);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2021-07-28 15:52:34 -04:00
|
|
|
void filemap_free_folio(struct address_space *mapping, struct folio *folio)
|
2017-11-15 17:37:18 -08:00
|
|
|
{
|
2022-05-01 07:35:31 -04:00
|
|
|
void (*free_folio)(struct folio *);
|
2022-01-07 13:03:48 -05:00
|
|
|
int refs = 1;
|
2017-11-15 17:37:18 -08:00
|
|
|
|
2022-05-01 07:35:31 -04:00
|
|
|
free_folio = mapping->a_ops->free_folio;
|
|
|
|
if (free_folio)
|
|
|
|
free_folio(folio);
|
2017-11-15 17:37:18 -08:00
|
|
|
|
2023-09-26 12:20:17 -07:00
|
|
|
if (folio_test_large(folio))
|
2022-01-07 13:03:48 -05:00
|
|
|
refs = folio_nr_pages(folio);
|
|
|
|
folio_put_refs(folio, refs);
|
2017-11-15 17:37:18 -08:00
|
|
|
}
|
|
|
|
|
2011-03-22 16:32:43 -07:00
|
|
|
/**
|
2021-05-09 09:33:42 -04:00
|
|
|
* filemap_remove_folio - Remove folio from page cache.
|
|
|
|
* @folio: The folio.
|
2011-03-22 16:32:43 -07:00
|
|
|
*
|
2021-05-09 09:33:42 -04:00
|
|
|
* This must be called only on folios that are locked and have been
|
|
|
|
* verified to be in the page cache. It will never put the folio into
|
|
|
|
* the free list because the caller has a reference on the page.
|
2011-03-22 16:32:43 -07:00
|
|
|
*/
|
2021-05-09 09:33:42 -04:00
|
|
|
void filemap_remove_folio(struct folio *folio)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2021-05-09 09:33:42 -04:00
|
|
|
struct address_space *mapping = folio->mapping;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2021-05-09 09:33:42 -04:00
|
|
|
BUG_ON(!folio_test_locked(folio));
|
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-08 18:31:24 -08:00
|
|
|
spin_lock(&mapping->host->i_lock);
|
2021-09-02 14:53:18 -07:00
|
|
|
xa_lock_irq(&mapping->i_pages);
|
2021-05-09 09:33:42 -04:00
|
|
|
__filemap_remove_folio(folio, NULL);
|
2021-09-02 14:53:18 -07:00
|
|
|
xa_unlock_irq(&mapping->i_pages);
|
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-08 18:31:24 -08:00
|
|
|
if (mapping_shrinkable(mapping))
|
|
|
|
inode_add_lru(mapping->host);
|
|
|
|
spin_unlock(&mapping->host->i_lock);
|
2010-12-01 13:35:19 -05:00
|
|
|
|
2021-05-09 09:33:42 -04:00
|
|
|
filemap_free_folio(mapping, folio);
|
2011-03-22 16:30:53 -07:00
|
|
|
}
|
|
|
|
|
2017-11-15 17:37:33 -08:00
|
|
|
/*
|
2021-12-07 14:15:07 -05:00
|
|
|
* page_cache_delete_batch - delete several folios from page cache
|
|
|
|
* @mapping: the mapping to which folios belong
|
|
|
|
* @fbatch: batch of folios to delete
|
2017-11-15 17:37:33 -08:00
|
|
|
*
|
2021-12-07 14:15:07 -05:00
|
|
|
* The function walks over mapping->i_pages and removes folios passed in
|
|
|
|
* @fbatch from the mapping. The function expects @fbatch to be sorted
|
|
|
|
* by page index and is optimised for it to be dense.
|
|
|
|
* It tolerates holes in @fbatch (mapping entries at those indices are not
|
|
|
|
* modified).
|
2017-11-15 17:37:33 -08:00
|
|
|
*
|
2018-04-10 16:36:56 -07:00
|
|
|
* The function expects the i_pages lock to be held.
|
2017-11-15 17:37:33 -08:00
|
|
|
*/
|
2017-12-04 03:59:45 -05:00
|
|
|
static void page_cache_delete_batch(struct address_space *mapping,
|
2021-12-07 14:15:07 -05:00
|
|
|
struct folio_batch *fbatch)
|
2017-11-15 17:37:33 -08:00
|
|
|
{
|
2021-12-07 14:15:07 -05:00
|
|
|
XA_STATE(xas, &mapping->i_pages, fbatch->folios[0]->index);
|
2020-06-27 22:19:08 -04:00
|
|
|
long total_pages = 0;
|
2019-09-23 15:34:52 -07:00
|
|
|
int i = 0;
|
2021-03-12 23:13:46 -05:00
|
|
|
struct folio *folio;
|
2017-11-15 17:37:33 -08:00
|
|
|
|
2017-12-04 03:59:45 -05:00
|
|
|
mapping_set_update(&xas, mapping);
|
2021-03-12 23:13:46 -05:00
|
|
|
xas_for_each(&xas, folio, ULONG_MAX) {
|
2021-12-07 14:15:07 -05:00
|
|
|
if (i >= folio_batch_count(fbatch))
|
2017-11-15 17:37:33 -08:00
|
|
|
break;
|
2019-09-23 15:34:52 -07:00
|
|
|
|
|
|
|
/* A swap/dax/shadow entry got inserted? Skip it. */
|
2021-03-12 23:13:46 -05:00
|
|
|
if (xa_is_value(folio))
|
2017-11-15 17:37:33 -08:00
|
|
|
continue;
|
2019-09-23 15:34:52 -07:00
|
|
|
/*
|
|
|
|
* A page got inserted in our range? Skip it. We have our
|
|
|
|
* pages locked so they are protected from being removed.
|
|
|
|
* If we see a page whose index is higher than ours, it
|
|
|
|
* means our page has been removed, which shouldn't be
|
|
|
|
* possible because we're holding the PageLock.
|
|
|
|
*/
|
2021-12-07 14:15:07 -05:00
|
|
|
if (folio != fbatch->folios[i]) {
|
2021-03-12 23:13:46 -05:00
|
|
|
VM_BUG_ON_FOLIO(folio->index >
|
2021-12-07 14:15:07 -05:00
|
|
|
fbatch->folios[i]->index, folio);
|
2019-09-23 15:34:52 -07:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2021-03-12 23:13:46 -05:00
|
|
|
WARN_ON_ONCE(!folio_test_locked(folio));
|
2019-09-23 15:34:52 -07:00
|
|
|
|
2020-06-27 22:19:08 -04:00
|
|
|
folio->mapping = NULL;
|
2021-12-07 14:15:07 -05:00
|
|
|
/* Leave folio->index set: truncation lookup relies on it */
|
2019-09-23 15:34:52 -07:00
|
|
|
|
2020-06-27 22:19:08 -04:00
|
|
|
i++;
|
2017-12-04 03:59:45 -05:00
|
|
|
xas_store(&xas, NULL);
|
2020-06-27 22:19:08 -04:00
|
|
|
total_pages += folio_nr_pages(folio);
|
2017-11-15 17:37:33 -08:00
|
|
|
}
|
|
|
|
mapping->nrpages -= total_pages;
|
|
|
|
}
|
|
|
|
|
|
|
|
void delete_from_page_cache_batch(struct address_space *mapping,
|
2021-12-07 14:15:07 -05:00
|
|
|
struct folio_batch *fbatch)
|
2017-11-15 17:37:33 -08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2021-12-07 14:15:07 -05:00
|
|
|
if (!folio_batch_count(fbatch))
|
2017-11-15 17:37:33 -08:00
|
|
|
return;
|
|
|
|
|
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-08 18:31:24 -08:00
|
|
|
spin_lock(&mapping->host->i_lock);
|
2021-09-02 14:53:18 -07:00
|
|
|
xa_lock_irq(&mapping->i_pages);
|
2021-12-07 14:15:07 -05:00
|
|
|
for (i = 0; i < folio_batch_count(fbatch); i++) {
|
|
|
|
struct folio *folio = fbatch->folios[i];
|
2017-11-15 17:37:33 -08:00
|
|
|
|
2021-07-23 09:29:46 -04:00
|
|
|
trace_mm_filemap_delete_from_page_cache(folio);
|
|
|
|
filemap_unaccount_folio(mapping, folio);
|
2017-11-15 17:37:33 -08:00
|
|
|
}
|
2021-12-07 14:15:07 -05:00
|
|
|
page_cache_delete_batch(mapping, fbatch);
|
2021-09-02 14:53:18 -07:00
|
|
|
xa_unlock_irq(&mapping->i_pages);
|
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-08 18:31:24 -08:00
|
|
|
if (mapping_shrinkable(mapping))
|
|
|
|
inode_add_lru(mapping->host);
|
|
|
|
spin_unlock(&mapping->host->i_lock);
|
2017-11-15 17:37:33 -08:00
|
|
|
|
2021-12-07 14:15:07 -05:00
|
|
|
for (i = 0; i < folio_batch_count(fbatch); i++)
|
|
|
|
filemap_free_folio(mapping, fbatch->folios[i]);
|
2017-11-15 17:37:33 -08:00
|
|
|
}
|
|
|
|
|
2016-07-29 14:10:57 +02:00
|
|
|
int filemap_check_errors(struct address_space *mapping)
|
2013-04-29 15:08:42 -07:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
/* Check for outstanding write errors */
|
2014-05-22 11:54:16 -07:00
|
|
|
if (test_bit(AS_ENOSPC, &mapping->flags) &&
|
|
|
|
test_and_clear_bit(AS_ENOSPC, &mapping->flags))
|
2013-04-29 15:08:42 -07:00
|
|
|
ret = -ENOSPC;
|
2014-05-22 11:54:16 -07:00
|
|
|
if (test_bit(AS_EIO, &mapping->flags) &&
|
|
|
|
test_and_clear_bit(AS_EIO, &mapping->flags))
|
2013-04-29 15:08:42 -07:00
|
|
|
ret = -EIO;
|
|
|
|
return ret;
|
|
|
|
}
|
2016-07-29 14:10:57 +02:00
|
|
|
EXPORT_SYMBOL(filemap_check_errors);
|
2013-04-29 15:08:42 -07:00
|
|
|
|
2017-07-06 07:02:22 -04:00
|
|
|
static int filemap_check_and_keep_errors(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
/* Check for outstanding write errors */
|
|
|
|
if (test_bit(AS_EIO, &mapping->flags))
|
|
|
|
return -EIO;
|
|
|
|
if (test_bit(AS_ENOSPC, &mapping->flags))
|
|
|
|
return -ENOSPC;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-07-14 14:47:22 -04:00
|
|
|
/**
|
|
|
|
* filemap_fdatawrite_wbc - start writeback on mapping dirty pages in range
|
|
|
|
* @mapping: address space structure to write
|
|
|
|
* @wbc: the writeback_control controlling the writeout
|
|
|
|
*
|
|
|
|
* Call writepages on the mapping using the provided wbc to control the
|
|
|
|
* writeout.
|
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
|
|
|
*/
|
|
|
|
int filemap_fdatawrite_wbc(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!mapping_can_writeback(mapping) ||
|
|
|
|
!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
wbc_attach_fdatawrite_inode(wbc, mapping->host);
|
|
|
|
ret = do_writepages(mapping, wbc);
|
|
|
|
wbc_detach_inode(wbc);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_fdatawrite_wbc);
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
2006-06-23 02:03:49 -07:00
|
|
|
* __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
|
2005-05-01 08:59:26 -07:00
|
|
|
* @mapping: address space structure to write
|
|
|
|
* @start: offset in bytes where the range starts
|
2006-03-24 03:17:45 -08:00
|
|
|
* @end: offset in bytes where the range ends (inclusive)
|
2005-05-01 08:59:26 -07:00
|
|
|
* @sync_mode: enable synchronous operation
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2006-06-23 02:03:49 -07:00
|
|
|
* Start writeback against all of a mapping's dirty pages that lie
|
|
|
|
* within the byte offsets <start, end> inclusive.
|
|
|
|
*
|
2005-04-16 15:20:36 -07:00
|
|
|
* If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
|
2006-06-23 02:03:49 -07:00
|
|
|
* opposed to a regular memory cleansing writeback. The difference between
|
2005-04-16 15:20:36 -07:00
|
|
|
* these two operations is that if a dirty page/buffer is encountered, it must
|
|
|
|
* be waited upon, and not just skipped over.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
[PATCH] fadvise(): write commands
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.
LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.
By combining these two operations the application may do several things:
LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.
It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.
To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().
The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.
As Ulrich notes, these two functions are somewhat abusive of the fadvise()
concept, which appears to be "set the future policy for this fd".
But these commands are a perfect fit with the fadvise() impementation, and
several of the existing fadvise() commands are synchronous and don't affect
future policy either. I think we can live with the slight incongruity.
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 03:18:04 -08:00
|
|
|
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
|
|
|
|
loff_t end, int sync_mode)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
struct writeback_control wbc = {
|
|
|
|
.sync_mode = sync_mode,
|
mm: write_cache_pages integrity fix
In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
so the function will return success after writing out nr_to_write pages,
even if that was not sufficient to guarantee data integrity.
The callers tend to set it to values that could break data interity
semantics easily in practice. For example, nr_to_write can be set to
mapping->nr_pages * 2, however if a file has a single, dirty page, then
fsync is called, subsequent pages might be concurrently added and dirtied,
then write_cache_pages might writeout two of these newly dirty pages,
while not writing out the old page that should have been written out.
Fix this by ignoring nr_to_write if it is a data integrity sync.
This is a data integrity bug.
The reason this has been done in the past is to avoid stalling sync
operations behind page dirtiers.
"If a file has one dirty page at offset 1000000000000000 then someone
does an fsync() and someone else gets in first and starts madly writing
pages at offset 0, we want to write that page at 1000000000000000.
Somehow."
What we do today is return success after an arbitrary amount of pages are
written, whether or not we have provided the data-integrity semantics that
the caller has asked for. Even this doesn't actually fix all stall cases
completely: in the above situation, if the file has a huge number of pages
in pagecache (but not dirty), then mapping->nrpages is going to be huge,
even if pages are being dirtied.
This change does indeed make the possibility of long stalls lager, and
that's not a good thing, but lying about data integrity is even worse. We
have to either perform the sync, or return -ELINUXISLAME so at least the
caller knows what has happened.
There are subsequent competing approaches in the works to solve the stall
problems properly, without compromising data integrity.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06 14:39:08 -08:00
|
|
|
.nr_to_write = LONG_MAX,
|
[PATCH] writeback: fix range handling
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 02:03:26 -07:00
|
|
|
.range_start = start,
|
|
|
|
.range_end = end,
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
2021-07-14 14:47:22 -04:00
|
|
|
return filemap_fdatawrite_wbc(mapping, &wbc);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int __filemap_fdatawrite(struct address_space *mapping,
|
|
|
|
int sync_mode)
|
|
|
|
{
|
[PATCH] writeback: fix range handling
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 02:03:26 -07:00
|
|
|
return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
int filemap_fdatawrite(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_fdatawrite);
|
|
|
|
|
2008-07-11 19:27:31 -04:00
|
|
|
int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
|
[PATCH] fadvise(): write commands
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.
LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.
By combining these two operations the application may do several things:
LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.
It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.
To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().
The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.
As Ulrich notes, these two functions are somewhat abusive of the fadvise()
concept, which appears to be "set the future policy for this fd".
But these commands are a perfect fit with the fadvise() impementation, and
several of the existing fadvise() commands are synchronous and don't affect
future policy either. I think we can live with the slight incongruity.
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 03:18:04 -08:00
|
|
|
loff_t end)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
|
|
|
|
}
|
2008-07-11 19:27:31 -04:00
|
|
|
EXPORT_SYMBOL(filemap_fdatawrite_range);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2006-06-23 02:03:49 -07:00
|
|
|
/**
|
|
|
|
* filemap_flush - mostly a non-blocking flush
|
|
|
|
* @mapping: target address_space
|
|
|
|
*
|
2005-04-16 15:20:36 -07:00
|
|
|
* This is a mostly non-blocking flush. Not suitable for data-integrity
|
|
|
|
* purposes - I/O may not be started against all dirty pages.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
|
|
|
int filemap_flush(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_flush);
|
|
|
|
|
2017-06-20 07:05:41 -05:00
|
|
|
/**
|
|
|
|
* filemap_range_has_page - check if a page exists in range.
|
|
|
|
* @mapping: address space within which to check
|
|
|
|
* @start_byte: offset in bytes where the range starts
|
|
|
|
* @end_byte: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Find at least one page in the range supplied, usually used to check if
|
|
|
|
* direct writing in this range will trigger a writeback.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: %true if at least one page exists in the specified range,
|
|
|
|
* %false otherwise.
|
2017-06-20 07:05:41 -05:00
|
|
|
*/
|
|
|
|
bool filemap_range_has_page(struct address_space *mapping,
|
|
|
|
loff_t start_byte, loff_t end_byte)
|
|
|
|
{
|
2023-01-16 19:39:40 +00:00
|
|
|
struct folio *folio;
|
2018-01-16 06:26:49 -05:00
|
|
|
XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
|
|
|
|
pgoff_t max = end_byte >> PAGE_SHIFT;
|
2017-06-20 07:05:41 -05:00
|
|
|
|
|
|
|
if (end_byte < start_byte)
|
|
|
|
return false;
|
|
|
|
|
2018-01-16 06:26:49 -05:00
|
|
|
rcu_read_lock();
|
|
|
|
for (;;) {
|
2023-01-16 19:39:40 +00:00
|
|
|
folio = xas_find(&xas, max);
|
|
|
|
if (xas_retry(&xas, folio))
|
2018-01-16 06:26:49 -05:00
|
|
|
continue;
|
|
|
|
/* Shadow entries don't count */
|
2023-01-16 19:39:40 +00:00
|
|
|
if (xa_is_value(folio))
|
2018-01-16 06:26:49 -05:00
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* We don't need to try to pin this page; we're about to
|
|
|
|
* release the RCU lock anyway. It is enough to know that
|
|
|
|
* there was a page here recently.
|
|
|
|
*/
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2017-06-20 07:05:41 -05:00
|
|
|
|
2023-01-16 19:39:40 +00:00
|
|
|
return folio != NULL;
|
2017-06-20 07:05:41 -05:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_range_has_page);
|
|
|
|
|
2017-07-06 07:02:24 -04:00
|
|
|
static void __filemap_fdatawait_range(struct address_space *mapping,
|
2015-11-05 18:47:23 -08:00
|
|
|
loff_t start_byte, loff_t end_byte)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
pgoff_t index = start_byte >> PAGE_SHIFT;
|
|
|
|
pgoff_t end = end_byte >> PAGE_SHIFT;
|
2023-01-04 13:14:28 -08:00
|
|
|
struct folio_batch fbatch;
|
|
|
|
unsigned nr_folios;
|
|
|
|
|
|
|
|
folio_batch_init(&fbatch);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2017-11-15 17:35:05 -08:00
|
|
|
while (index <= end) {
|
2005-04-16 15:20:36 -07:00
|
|
|
unsigned i;
|
|
|
|
|
2023-01-04 13:14:28 -08:00
|
|
|
nr_folios = filemap_get_folios_tag(mapping, &index, end,
|
|
|
|
PAGECACHE_TAG_WRITEBACK, &fbatch);
|
|
|
|
|
|
|
|
if (!nr_folios)
|
2017-11-15 17:35:05 -08:00
|
|
|
break;
|
|
|
|
|
2023-01-04 13:14:28 -08:00
|
|
|
for (i = 0; i < nr_folios; i++) {
|
|
|
|
struct folio *folio = fbatch.folios[i];
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-01-04 13:14:28 -08:00
|
|
|
folio_wait_writeback(folio);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2023-01-04 13:14:28 -08:00
|
|
|
folio_batch_release(&fbatch);
|
2005-04-16 15:20:36 -07:00
|
|
|
cond_resched();
|
|
|
|
}
|
2015-11-05 18:47:23 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* filemap_fdatawait_range - wait for writeback to complete
|
|
|
|
* @mapping: address space structure to wait for
|
|
|
|
* @start_byte: offset in bytes where the range starts
|
|
|
|
* @end_byte: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Walk the list of under-writeback pages of the given address space
|
|
|
|
* in the given range and wait for all of them. Check error status of
|
|
|
|
* the address space and return it.
|
|
|
|
*
|
|
|
|
* Since the error status of the address space is cleared by this function,
|
|
|
|
* callers are responsible for checking the return value and handling and/or
|
|
|
|
* reporting the error.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: error status of the address space.
|
2015-11-05 18:47:23 -08:00
|
|
|
*/
|
|
|
|
int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
|
|
|
|
loff_t end_byte)
|
|
|
|
{
|
2017-07-06 07:02:24 -04:00
|
|
|
__filemap_fdatawait_range(mapping, start_byte, end_byte);
|
|
|
|
return filemap_check_errors(mapping);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2009-08-17 19:30:27 +02:00
|
|
|
EXPORT_SYMBOL(filemap_fdatawait_range);
|
|
|
|
|
2019-06-20 17:05:37 -04:00
|
|
|
/**
|
|
|
|
* filemap_fdatawait_range_keep_errors - wait for writeback to complete
|
|
|
|
* @mapping: address space structure to wait for
|
|
|
|
* @start_byte: offset in bytes where the range starts
|
|
|
|
* @end_byte: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Walk the list of under-writeback pages of the given address space in the
|
|
|
|
* given range and wait for all of them. Unlike filemap_fdatawait_range(),
|
|
|
|
* this function does not clear error status of the address space.
|
|
|
|
*
|
|
|
|
* Use this function if callers don't handle errors themselves. Expected
|
|
|
|
* call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
|
|
|
|
* fsfreeze(8)
|
|
|
|
*/
|
|
|
|
int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
|
|
|
|
loff_t start_byte, loff_t end_byte)
|
|
|
|
{
|
|
|
|
__filemap_fdatawait_range(mapping, start_byte, end_byte);
|
|
|
|
return filemap_check_and_keep_errors(mapping);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors);
|
|
|
|
|
2017-07-28 07:24:43 -04:00
|
|
|
/**
|
|
|
|
* file_fdatawait_range - wait for writeback to complete
|
|
|
|
* @file: file pointing to address space structure to wait for
|
|
|
|
* @start_byte: offset in bytes where the range starts
|
|
|
|
* @end_byte: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Walk the list of under-writeback pages of the address space that file
|
|
|
|
* refers to, in the given range and wait for all of them. Check error
|
|
|
|
* status of the address space vs. the file->f_wb_err cursor and return it.
|
|
|
|
*
|
|
|
|
* Since the error status of the file is advanced by this function,
|
|
|
|
* callers are responsible for checking the return value and handling and/or
|
|
|
|
* reporting the error.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: error status of the address space vs. the file->f_wb_err cursor.
|
2017-07-28 07:24:43 -04:00
|
|
|
*/
|
|
|
|
int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
|
|
|
|
__filemap_fdatawait_range(mapping, start_byte, end_byte);
|
|
|
|
return file_check_and_advance_wb_err(file);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(file_fdatawait_range);
|
2009-08-17 19:30:27 +02:00
|
|
|
|
2015-11-05 18:47:23 -08:00
|
|
|
/**
|
|
|
|
* filemap_fdatawait_keep_errors - wait for writeback without clearing errors
|
|
|
|
* @mapping: address space structure to wait for
|
|
|
|
*
|
|
|
|
* Walk the list of under-writeback pages of the given address space
|
|
|
|
* and wait for all of them. Unlike filemap_fdatawait(), this function
|
|
|
|
* does not clear error status of the address space.
|
|
|
|
*
|
|
|
|
* Use this function if callers don't handle errors themselves. Expected
|
|
|
|
* call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
|
|
|
|
* fsfreeze(8)
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: error status of the address space.
|
2015-11-05 18:47:23 -08:00
|
|
|
*/
|
2017-07-06 07:02:22 -04:00
|
|
|
int filemap_fdatawait_keep_errors(struct address_space *mapping)
|
2015-11-05 18:47:23 -08:00
|
|
|
{
|
2017-07-31 10:29:38 -04:00
|
|
|
__filemap_fdatawait_range(mapping, 0, LLONG_MAX);
|
2017-07-06 07:02:22 -04:00
|
|
|
return filemap_check_and_keep_errors(mapping);
|
2015-11-05 18:47:23 -08:00
|
|
|
}
|
2017-07-06 07:02:22 -04:00
|
|
|
EXPORT_SYMBOL(filemap_fdatawait_keep_errors);
|
2015-11-05 18:47:23 -08:00
|
|
|
|
2019-09-23 15:34:48 -07:00
|
|
|
/* Returns true if writeback might be needed or already in progress. */
|
2017-07-26 10:21:11 -04:00
|
|
|
static bool mapping_needs_writeback(struct address_space *mapping)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2019-09-23 15:34:48 -07:00
|
|
|
return mapping->nrpages;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2021-10-28 08:47:05 -06:00
|
|
|
bool filemap_range_has_writeback(struct address_space *mapping,
|
|
|
|
loff_t start_byte, loff_t end_byte)
|
2021-11-05 13:37:13 -07:00
|
|
|
{
|
|
|
|
XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
|
|
|
|
pgoff_t max = end_byte >> PAGE_SHIFT;
|
2022-09-05 14:45:57 -07:00
|
|
|
struct folio *folio;
|
2021-11-05 13:37:13 -07:00
|
|
|
|
|
|
|
if (end_byte < start_byte)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2022-09-05 14:45:57 -07:00
|
|
|
xas_for_each(&xas, folio, max) {
|
|
|
|
if (xas_retry(&xas, folio))
|
2021-11-05 13:37:13 -07:00
|
|
|
continue;
|
2022-09-05 14:45:57 -07:00
|
|
|
if (xa_is_value(folio))
|
2021-11-05 13:37:13 -07:00
|
|
|
continue;
|
2022-09-05 14:45:57 -07:00
|
|
|
if (folio_test_dirty(folio) || folio_test_locked(folio) ||
|
|
|
|
folio_test_writeback(folio))
|
2021-11-05 13:37:13 -07:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2022-09-05 14:45:57 -07:00
|
|
|
return folio != NULL;
|
mm: provide filemap_range_needs_writeback() helper
Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.
An internal workload complained because it was using too much CPU, and
when I took a look, we had a lot of io_uring workers going to town.
For an async buffered read like workload, I am normally expecting _zero_
offloads to a worker thread, but this one had tons of them. I'd drop
caches and things would look good again, but then a minute later we'd
regress back to using workers. Turns out that every minute something
was reading parts of the device, which would add page cache for that
inode. I put patches like these in for our kernel, and the problem was
solved.
Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
entries for the given range. This causes unnecessary work from the
callers side, when the IO could have been issued totally fine without
blocking on writeback when there is none.
This patch (of 3):
For O_DIRECT reads/writes, we check if we need to issue a call to
filemap_write_and_wait_range() to issue and/or wait for writeback for any
page in the given range. The existing mechanism just checks for a page in
the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
slow path (and needing retry) if there's just a clean page cache page in
the range.
Provide filemap_range_needs_writeback() which tries a little harder to
check if we actually need to issue and/or wait for writeback in the range.
Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-29 22:55:18 -07:00
|
|
|
}
|
2021-10-28 08:47:05 -06:00
|
|
|
EXPORT_SYMBOL_GPL(filemap_range_has_writeback);
|
mm: provide filemap_range_needs_writeback() helper
Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.
An internal workload complained because it was using too much CPU, and
when I took a look, we had a lot of io_uring workers going to town.
For an async buffered read like workload, I am normally expecting _zero_
offloads to a worker thread, but this one had tons of them. I'd drop
caches and things would look good again, but then a minute later we'd
regress back to using workers. Turns out that every minute something
was reading parts of the device, which would add page cache for that
inode. I put patches like these in for our kernel, and the problem was
solved.
Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
entries for the given range. This causes unnecessary work from the
callers side, when the IO could have been issued totally fine without
blocking on writeback when there is none.
This patch (of 3):
For O_DIRECT reads/writes, we check if we need to issue a call to
filemap_write_and_wait_range() to issue and/or wait for writeback for any
page in the given range. The existing mechanism just checks for a page in
the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
slow path (and needing retry) if there's just a clean page cache page in
the range.
Provide filemap_range_needs_writeback() which tries a little harder to
check if we actually need to issue and/or wait for writeback in the range.
Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-29 22:55:18 -07:00
|
|
|
|
2006-06-23 02:03:49 -07:00
|
|
|
/**
|
|
|
|
* filemap_write_and_wait_range - write out & wait on a file range
|
|
|
|
* @mapping: the address_space for the pages
|
|
|
|
* @lstart: offset in bytes where the range starts
|
|
|
|
* @lend: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
2006-03-24 03:17:45 -08:00
|
|
|
* Write out and wait upon file offsets lstart->lend, inclusive.
|
|
|
|
*
|
2017-03-30 17:11:36 -03:00
|
|
|
* Note that @lend is inclusive (describes the last byte to be written) so
|
2006-03-24 03:17:45 -08:00
|
|
|
* that this function can be used to write to the very end-of-file (end = -1).
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: error status of the address space.
|
2006-03-24 03:17:45 -08:00
|
|
|
*/
|
2005-04-16 15:20:36 -07:00
|
|
|
int filemap_write_and_wait_range(struct address_space *mapping,
|
|
|
|
loff_t lstart, loff_t lend)
|
|
|
|
{
|
2022-06-27 21:23:51 +08:00
|
|
|
int err = 0, err2;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
filemap: skip write and wait if end offset precedes start
Patch series "filemap: skip write and wait if end offset precedes start",
v2.
A fix for the odd write and wait behavior described in the patch 1 commit
log. Technically patch 1 could simply remove the check rather than lift
it into the callers, but this seemed a bit more user friendly to me.
Patch 2 is appended after observation that fadvise() interacted poorly
with the v1 patch. This is no longer a problem with v2, making patch 2
purely a cleanup.
This series survived both fstests and ltp regression runs without
observable problems. I had (end < start) warning checks in each relevant
function, with fadvise() being the only caller that triggered them. That
said, I dropped the warnings after testing because there seemed to much
potential for noise from the various other callers.
This patch (of 2):
A call to file[map]_write_and_wait_range() with an end offset that
precedes the start offset but happens to land in the same page can trigger
writeback submission but fails to wait on the submitted page. Writeback
submission occurs because __filemap_fdatawrite_range() passes both offsets
down into write_cache_pages(), which rounds down to page indexes before it
starts processing writeback. However, __filemap_fdatawait_range()
immediately returns if the byte-granular end offset precedes the start
offset.
This behavior was observed in the form of unpredictable latency from a
frequent write and wait call with incorrect parameters. The behavior gave
the impression that the fdatawait path might occasionally fail to wait on
writeback, but further investigation showed the latency was from
write_cache_pages() waiting on writeback state to clear for a page already
under writeback. Therefore, this indicated that fdatawait actually never
waits on writeback in this particular situation.
The byte granular check in __filemap_fdatawait_range() goes all the way
back to the old wait_on_page_writeback() helper. It originally used page
offsets and so would have waited in this problematic case. That changed
to byte granularity file offsets in commit 94004ed726f3 ("kill
wait_on_page_writeback_range"), which subtly changed this behavior. The
check itself has become somewhat redundant since the error checking code
that used to follow the wait loop (at the time of the aforementioned
commit) has now been removed and lifted into the higher level callers.
Therefore, we can restore historical fdatawait behavior by simply removing
the check. Since the current fdatawait behavior has been in place for
quite some time and is consistent with other interfaces that use file
offsets, instead lift the check into the file[map]_write_and_wait_range()
helpers to provide consistent behavior between the write and wait.
Link: https://lkml.kernel.org/r/20221128155632.3950447-1-bfoster@redhat.com
Link: https://lkml.kernel.org/r/20221128155632.3950447-2-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-28 10:56:31 -05:00
|
|
|
if (lend < lstart)
|
|
|
|
return 0;
|
|
|
|
|
2017-07-26 10:21:11 -04:00
|
|
|
if (mapping_needs_writeback(mapping)) {
|
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 01:02:14 -08:00
|
|
|
err = __filemap_fdatawrite_range(mapping, lstart, lend,
|
|
|
|
WB_SYNC_ALL);
|
2020-01-30 22:12:07 -08:00
|
|
|
/*
|
|
|
|
* Even if the above returned error, the pages may be
|
|
|
|
* written partially (e.g. -ENOSPC), so we wait for it.
|
|
|
|
* But the -EIO is special case, it may indicate the worst
|
|
|
|
* thing (e.g. bug) happened, so we avoid waiting for it.
|
|
|
|
*/
|
2022-06-27 21:23:51 +08:00
|
|
|
if (err != -EIO)
|
|
|
|
__filemap_fdatawait_range(mapping, lstart, lend);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2022-06-27 21:23:51 +08:00
|
|
|
err2 = filemap_check_errors(mapping);
|
|
|
|
if (!err)
|
|
|
|
err = err2;
|
[PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait)
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 01:02:14 -08:00
|
|
|
return err;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2009-04-15 13:22:37 -04:00
|
|
|
EXPORT_SYMBOL(filemap_write_and_wait_range);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 07:02:25 -04:00
|
|
|
void __filemap_set_wb_err(struct address_space *mapping, int err)
|
|
|
|
{
|
2017-07-24 06:22:15 -04:00
|
|
|
errseq_t eseq = errseq_set(&mapping->wb_err, err);
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 07:02:25 -04:00
|
|
|
|
|
|
|
trace_filemap_set_wb_err(mapping, eseq);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__filemap_set_wb_err);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* file_check_and_advance_wb_err - report wb error (if any) that was previously
|
|
|
|
* and advance wb_err to current one
|
|
|
|
* @file: struct file on which the error is being reported
|
|
|
|
*
|
|
|
|
* When userland calls fsync (or something like nfsd does the equivalent), we
|
|
|
|
* want to report any writeback errors that occurred since the last fsync (or
|
|
|
|
* since the file was opened if there haven't been any).
|
|
|
|
*
|
|
|
|
* Grab the wb_err from the mapping. If it matches what we have in the file,
|
|
|
|
* then just quickly return 0. The file is all caught up.
|
|
|
|
*
|
|
|
|
* If it doesn't match, then take the mapping value, set the "seen" flag in
|
|
|
|
* it and try to swap it into place. If it works, or another task beat us
|
|
|
|
* to it with the new value, then update the f_wb_err and return the error
|
|
|
|
* portion. The error at this point must be reported via proper channels
|
|
|
|
* (a'la fsync, or NFS COMMIT operation, etc.).
|
|
|
|
*
|
|
|
|
* While we handle mapping->wb_err with atomic operations, the f_wb_err
|
|
|
|
* value is protected by the f_lock since we must ensure that it reflects
|
|
|
|
* the latest value swapped in for this file descriptor.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 07:02:25 -04:00
|
|
|
*/
|
|
|
|
int file_check_and_advance_wb_err(struct file *file)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
errseq_t old = READ_ONCE(file->f_wb_err);
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
|
|
|
|
/* Locklessly handle the common case where nothing has changed */
|
|
|
|
if (errseq_check(&mapping->wb_err, old)) {
|
|
|
|
/* Something changed, must use slow path */
|
|
|
|
spin_lock(&file->f_lock);
|
|
|
|
old = file->f_wb_err;
|
|
|
|
err = errseq_check_and_advance(&mapping->wb_err,
|
|
|
|
&file->f_wb_err);
|
|
|
|
trace_file_check_and_advance_wb_err(file, old);
|
|
|
|
spin_unlock(&file->f_lock);
|
|
|
|
}
|
2017-10-03 16:15:25 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We're mostly using this function as a drop in replacement for
|
|
|
|
* filemap_check_errors. Clear AS_EIO/AS_ENOSPC to emulate the effect
|
|
|
|
* that the legacy code would have had on these flags.
|
|
|
|
*/
|
|
|
|
clear_bit(AS_EIO, &mapping->flags);
|
|
|
|
clear_bit(AS_ENOSPC, &mapping->flags);
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 07:02:25 -04:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(file_check_and_advance_wb_err);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* file_write_and_wait_range - write out & wait on a file range
|
|
|
|
* @file: file pointing to address_space with pages
|
|
|
|
* @lstart: offset in bytes where the range starts
|
|
|
|
* @lend: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Write out and wait upon file offsets lstart->lend, inclusive.
|
|
|
|
*
|
|
|
|
* Note that @lend is inclusive (describes the last byte to be written) so
|
|
|
|
* that this function can be used to write to the very end-of-file (end = -1).
|
|
|
|
*
|
|
|
|
* After writing out and waiting on the data, we check and advance the
|
|
|
|
* f_wb_err cursor to the latest value, and return any errors detected there.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: %0 on success, negative error code otherwise.
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 07:02:25 -04:00
|
|
|
*/
|
|
|
|
int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)
|
|
|
|
{
|
|
|
|
int err = 0, err2;
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
|
filemap: skip write and wait if end offset precedes start
Patch series "filemap: skip write and wait if end offset precedes start",
v2.
A fix for the odd write and wait behavior described in the patch 1 commit
log. Technically patch 1 could simply remove the check rather than lift
it into the callers, but this seemed a bit more user friendly to me.
Patch 2 is appended after observation that fadvise() interacted poorly
with the v1 patch. This is no longer a problem with v2, making patch 2
purely a cleanup.
This series survived both fstests and ltp regression runs without
observable problems. I had (end < start) warning checks in each relevant
function, with fadvise() being the only caller that triggered them. That
said, I dropped the warnings after testing because there seemed to much
potential for noise from the various other callers.
This patch (of 2):
A call to file[map]_write_and_wait_range() with an end offset that
precedes the start offset but happens to land in the same page can trigger
writeback submission but fails to wait on the submitted page. Writeback
submission occurs because __filemap_fdatawrite_range() passes both offsets
down into write_cache_pages(), which rounds down to page indexes before it
starts processing writeback. However, __filemap_fdatawait_range()
immediately returns if the byte-granular end offset precedes the start
offset.
This behavior was observed in the form of unpredictable latency from a
frequent write and wait call with incorrect parameters. The behavior gave
the impression that the fdatawait path might occasionally fail to wait on
writeback, but further investigation showed the latency was from
write_cache_pages() waiting on writeback state to clear for a page already
under writeback. Therefore, this indicated that fdatawait actually never
waits on writeback in this particular situation.
The byte granular check in __filemap_fdatawait_range() goes all the way
back to the old wait_on_page_writeback() helper. It originally used page
offsets and so would have waited in this problematic case. That changed
to byte granularity file offsets in commit 94004ed726f3 ("kill
wait_on_page_writeback_range"), which subtly changed this behavior. The
check itself has become somewhat redundant since the error checking code
that used to follow the wait loop (at the time of the aforementioned
commit) has now been removed and lifted into the higher level callers.
Therefore, we can restore historical fdatawait behavior by simply removing
the check. Since the current fdatawait behavior has been in place for
quite some time and is consistent with other interfaces that use file
offsets, instead lift the check into the file[map]_write_and_wait_range()
helpers to provide consistent behavior between the write and wait.
Link: https://lkml.kernel.org/r/20221128155632.3950447-1-bfoster@redhat.com
Link: https://lkml.kernel.org/r/20221128155632.3950447-2-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-28 10:56:31 -05:00
|
|
|
if (lend < lstart)
|
|
|
|
return 0;
|
|
|
|
|
2017-07-26 10:21:11 -04:00
|
|
|
if (mapping_needs_writeback(mapping)) {
|
fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.
The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.
If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.
This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.
In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.
One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.
This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.
This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).
Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.
The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-06 07:02:25 -04:00
|
|
|
err = __filemap_fdatawrite_range(mapping, lstart, lend,
|
|
|
|
WB_SYNC_ALL);
|
|
|
|
/* See comment of filemap_write_and_wait() */
|
|
|
|
if (err != -EIO)
|
|
|
|
__filemap_fdatawait_range(mapping, lstart, lend);
|
|
|
|
}
|
|
|
|
err2 = file_check_and_advance_wb_err(file);
|
|
|
|
if (!err)
|
|
|
|
err = err2;
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(file_write_and_wait_range);
|
|
|
|
|
2011-03-22 16:30:52 -07:00
|
|
|
/**
|
2022-11-01 10:53:22 -07:00
|
|
|
* replace_page_cache_folio - replace a pagecache folio with a new one
|
|
|
|
* @old: folio to be replaced
|
|
|
|
* @new: folio to replace with
|
|
|
|
*
|
|
|
|
* This function replaces a folio in the pagecache with a new one. On
|
|
|
|
* success it acquires the pagecache reference for the new folio and
|
|
|
|
* drops it for the old folio. Both the old and new folios must be
|
|
|
|
* locked. This function does not add the new folio to the LRU, the
|
2011-03-22 16:30:52 -07:00
|
|
|
* caller must do that.
|
|
|
|
*
|
2017-11-17 10:01:45 -05:00
|
|
|
* The remove + add is atomic. This function cannot fail.
|
2011-03-22 16:30:52 -07:00
|
|
|
*/
|
2022-11-01 10:53:22 -07:00
|
|
|
void replace_page_cache_folio(struct folio *old, struct folio *new)
|
2011-03-22 16:30:52 -07:00
|
|
|
{
|
2017-11-17 10:01:45 -05:00
|
|
|
struct address_space *mapping = old->mapping;
|
2022-05-01 07:35:31 -04:00
|
|
|
void (*free_folio)(struct folio *) = mapping->a_ops->free_folio;
|
2017-11-17 10:01:45 -05:00
|
|
|
pgoff_t offset = old->index;
|
|
|
|
XA_STATE(xas, &mapping->i_pages, offset);
|
2011-03-22 16:30:52 -07:00
|
|
|
|
2022-11-01 10:53:22 -07:00
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
|
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
|
|
|
|
VM_BUG_ON_FOLIO(new->mapping, new);
|
2011-03-22 16:30:52 -07:00
|
|
|
|
2022-11-01 10:53:22 -07:00
|
|
|
folio_get(new);
|
2017-11-17 10:01:45 -05:00
|
|
|
new->mapping = mapping;
|
|
|
|
new->index = offset;
|
2011-03-22 16:30:52 -07:00
|
|
|
|
2023-10-06 11:46:27 -07:00
|
|
|
mem_cgroup_replace_folio(old, new);
|
2020-06-03 16:01:54 -07:00
|
|
|
|
2021-09-02 14:53:18 -07:00
|
|
|
xas_lock_irq(&xas);
|
2017-11-17 10:01:45 -05:00
|
|
|
xas_store(&xas, new);
|
2015-06-24 16:57:24 -07:00
|
|
|
|
2017-11-17 10:01:45 -05:00
|
|
|
old->mapping = NULL;
|
|
|
|
/* hugetlb pages do not participate in page cache accounting. */
|
2022-11-01 10:53:22 -07:00
|
|
|
if (!folio_test_hugetlb(old))
|
|
|
|
__lruvec_stat_sub_folio(old, NR_FILE_PAGES);
|
|
|
|
if (!folio_test_hugetlb(new))
|
|
|
|
__lruvec_stat_add_folio(new, NR_FILE_PAGES);
|
|
|
|
if (folio_test_swapbacked(old))
|
|
|
|
__lruvec_stat_sub_folio(old, NR_SHMEM);
|
|
|
|
if (folio_test_swapbacked(new))
|
|
|
|
__lruvec_stat_add_folio(new, NR_SHMEM);
|
2021-09-02 14:53:18 -07:00
|
|
|
xas_unlock_irq(&xas);
|
2022-05-01 07:35:31 -04:00
|
|
|
if (free_folio)
|
2022-11-01 10:53:22 -07:00
|
|
|
free_folio(old);
|
|
|
|
folio_put(old);
|
2011-03-22 16:30:52 -07:00
|
|
|
}
|
2022-11-01 10:53:22 -07:00
|
|
|
EXPORT_SYMBOL_GPL(replace_page_cache_folio);
|
2011-03-22 16:30:52 -07:00
|
|
|
|
2020-12-08 08:56:28 -05:00
|
|
|
noinline int __filemap_add_folio(struct address_space *mapping,
|
|
|
|
struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2020-12-08 08:56:28 -05:00
|
|
|
XA_STATE(xas, &mapping->i_pages, index);
|
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 01:18:56 +08:00
|
|
|
void *alloced_shadow = NULL;
|
|
|
|
int alloced_order = 0;
|
|
|
|
bool huge;
|
2024-04-16 01:18:54 +08:00
|
|
|
long nr;
|
2008-07-25 19:45:30 -07:00
|
|
|
|
2020-12-08 08:56:28 -05:00
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
|
|
|
VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
|
2024-08-22 15:50:10 +02:00
|
|
|
VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
|
|
|
|
folio);
|
2017-11-17 10:01:45 -05:00
|
|
|
mapping_set_update(&xas, mapping);
|
2008-07-25 19:45:30 -07:00
|
|
|
|
2023-09-26 12:20:17 -07:00
|
|
|
VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
|
|
|
|
xas_set_order(&xas, index, folio_order(folio));
|
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 01:18:56 +08:00
|
|
|
huge = folio_test_hugetlb(folio);
|
2023-09-26 12:20:17 -07:00
|
|
|
nr = folio_nr_pages(folio);
|
|
|
|
|
2020-10-15 20:05:20 -07:00
|
|
|
gfp &= GFP_RECLAIM_MASK;
|
2019-09-05 14:03:12 -04:00
|
|
|
folio_ref_add(folio, nr);
|
|
|
|
folio->mapping = mapping;
|
|
|
|
folio->index = xas.xa_index;
|
2020-10-15 20:05:20 -07:00
|
|
|
|
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 01:18:56 +08:00
|
|
|
for (;;) {
|
|
|
|
int order = -1, split_order = 0;
|
2020-10-15 20:05:20 -07:00
|
|
|
void *entry, *old = NULL;
|
|
|
|
|
2017-11-17 10:01:45 -05:00
|
|
|
xas_lock_irq(&xas);
|
2020-10-15 20:05:20 -07:00
|
|
|
xas_for_each_conflict(&xas, entry) {
|
|
|
|
old = entry;
|
|
|
|
if (!xa_is_value(entry)) {
|
|
|
|
xas_set_err(&xas, -EEXIST);
|
|
|
|
goto unlock;
|
|
|
|
}
|
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 01:18:56 +08:00
|
|
|
/*
|
|
|
|
* If a larger entry exists,
|
|
|
|
* it will be the first and only entry iterated.
|
|
|
|
*/
|
|
|
|
if (order == -1)
|
|
|
|
order = xas_get_order(&xas);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* entry may have changed before we re-acquire the lock */
|
|
|
|
if (alloced_order && (old != alloced_shadow || order != alloced_order)) {
|
|
|
|
xas_destroy(&xas);
|
|
|
|
alloced_order = 0;
|
2020-10-15 20:05:20 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (old) {
|
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 01:18:56 +08:00
|
|
|
if (order > 0 && order > folio_order(folio)) {
|
2019-09-05 14:03:12 -04:00
|
|
|
/* How to handle large swap entries? */
|
|
|
|
BUG_ON(shmem_mapping(mapping));
|
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 01:18:56 +08:00
|
|
|
if (!alloced_order) {
|
|
|
|
split_order = order;
|
|
|
|
goto unlock;
|
|
|
|
}
|
2020-10-15 20:05:20 -07:00
|
|
|
xas_split(&xas, old, order);
|
|
|
|
xas_reset(&xas);
|
|
|
|
}
|
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 01:18:56 +08:00
|
|
|
if (shadowp)
|
|
|
|
*shadowp = old;
|
2020-10-15 20:05:20 -07:00
|
|
|
}
|
|
|
|
|
2020-12-08 08:56:28 -05:00
|
|
|
xas_store(&xas, folio);
|
2017-11-17 10:01:45 -05:00
|
|
|
if (xas_error(&xas))
|
|
|
|
goto unlock;
|
|
|
|
|
2019-09-05 14:03:12 -04:00
|
|
|
mapping->nrpages += nr;
|
2017-11-17 10:01:45 -05:00
|
|
|
|
|
|
|
/* hugetlb pages do not participate in page cache accounting */
|
2019-09-05 14:03:12 -04:00
|
|
|
if (!huge) {
|
|
|
|
__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
|
|
|
|
if (folio_test_pmd_mappable(folio))
|
|
|
|
__lruvec_stat_mod_folio(folio,
|
|
|
|
NR_FILE_THPS, nr);
|
|
|
|
}
|
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 01:18:56 +08:00
|
|
|
|
2017-11-17 10:01:45 -05:00
|
|
|
unlock:
|
|
|
|
xas_unlock_irq(&xas);
|
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16 01:18:56 +08:00
|
|
|
|
|
|
|
/* split needed, alloc here and retry. */
|
|
|
|
if (split_order) {
|
|
|
|
xas_split_alloc(&xas, old, split_order, gfp);
|
|
|
|
if (xas_error(&xas))
|
|
|
|
goto error;
|
|
|
|
alloced_shadow = old;
|
|
|
|
alloced_order = split_order;
|
|
|
|
xas_reset(&xas);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!xas_nomem(&xas, gfp))
|
|
|
|
break;
|
|
|
|
}
|
2017-11-17 10:01:45 -05:00
|
|
|
|
2019-09-05 14:03:12 -04:00
|
|
|
if (xas_error(&xas))
|
2017-11-17 10:01:45 -05:00
|
|
|
goto error;
|
2015-06-24 16:57:24 -07:00
|
|
|
|
2021-07-23 09:29:46 -04:00
|
|
|
trace_mm_filemap_add_to_page_cache(folio);
|
2013-09-12 15:13:59 -07:00
|
|
|
return 0;
|
2017-11-17 10:01:45 -05:00
|
|
|
error:
|
2020-12-08 08:56:28 -05:00
|
|
|
folio->mapping = NULL;
|
2013-09-12 15:13:59 -07:00
|
|
|
/* Leave page->index set: truncation relies upon it */
|
2019-09-05 14:03:12 -04:00
|
|
|
folio_put_refs(folio, nr);
|
|
|
|
return xas_error(&xas);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2020-12-08 08:56:28 -05:00
|
|
|
ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
|
2014-04-03 14:47:51 -07:00
|
|
|
|
2020-12-08 08:56:28 -05:00
|
|
|
int filemap_add_folio(struct address_space *mapping, struct folio *folio,
|
|
|
|
pgoff_t index, gfp_t gfp)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2014-04-03 14:47:51 -07:00
|
|
|
void *shadow = NULL;
|
2008-10-18 20:26:32 -07:00
|
|
|
int ret;
|
|
|
|
|
2024-04-16 01:18:54 +08:00
|
|
|
ret = mem_cgroup_charge(folio, NULL, gfp);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2020-12-08 08:56:28 -05:00
|
|
|
__folio_set_locked(folio);
|
|
|
|
ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
|
2024-04-16 01:18:54 +08:00
|
|
|
if (unlikely(ret)) {
|
|
|
|
mem_cgroup_uncharge(folio);
|
2020-12-08 08:56:28 -05:00
|
|
|
__folio_clear_locked(folio);
|
2024-04-16 01:18:54 +08:00
|
|
|
} else {
|
2014-04-03 14:47:51 -07:00
|
|
|
/*
|
2020-12-08 08:56:28 -05:00
|
|
|
* The folio might have been evicted from cache only
|
2014-04-03 14:47:51 -07:00
|
|
|
* recently, in which case it should be activated like
|
2020-12-08 08:56:28 -05:00
|
|
|
* any other repeatedly accessed folio.
|
|
|
|
* The exception is folios getting rewritten; evicting other
|
2016-05-20 16:56:25 -07:00
|
|
|
* data from the working set, only to cache data that will
|
|
|
|
* get overwritten with something else, is a waste of memory.
|
2014-04-03 14:47:51 -07:00
|
|
|
*/
|
2020-12-08 08:56:28 -05:00
|
|
|
WARN_ON_ONCE(folio_test_active(folio));
|
|
|
|
if (!(gfp & __GFP_WRITE) && shadow)
|
|
|
|
workingset_refault(folio, shadow);
|
|
|
|
folio_add_lru(folio);
|
2014-04-03 14:47:51 -07:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
return ret;
|
|
|
|
}
|
2020-12-08 08:56:28 -05:00
|
|
|
EXPORT_SYMBOL_GPL(filemap_add_folio);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2006-03-24 03:16:04 -08:00
|
|
|
#ifdef CONFIG_NUMA
|
2024-03-21 09:36:40 -07:00
|
|
|
struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
|
2006-03-24 03:16:04 -08:00
|
|
|
{
|
2010-05-24 14:32:08 -07:00
|
|
|
int n;
|
2020-12-15 23:11:07 -05:00
|
|
|
struct folio *folio;
|
2010-05-24 14:32:08 -07:00
|
|
|
|
2006-03-24 03:16:04 -08:00
|
|
|
if (cpuset_do_page_mem_spread()) {
|
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 16:34:11 -07:00
|
|
|
unsigned int cpuset_mems_cookie;
|
|
|
|
do {
|
2014-04-03 14:47:24 -07:00
|
|
|
cpuset_mems_cookie = read_mems_allowed_begin();
|
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 16:34:11 -07:00
|
|
|
n = cpuset_mem_spread_node();
|
2024-05-31 13:53:50 -07:00
|
|
|
folio = __folio_alloc_node_noprof(gfp, order, n);
|
2020-12-15 23:11:07 -05:00
|
|
|
} while (!folio && read_mems_allowed_retry(cpuset_mems_cookie));
|
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 16:34:11 -07:00
|
|
|
|
2020-12-15 23:11:07 -05:00
|
|
|
return folio;
|
2006-03-24 03:16:04 -08:00
|
|
|
}
|
2024-03-21 09:36:40 -07:00
|
|
|
return folio_alloc_noprof(gfp, order);
|
2006-03-24 03:16:04 -08:00
|
|
|
}
|
2024-03-21 09:36:40 -07:00
|
|
|
EXPORT_SYMBOL(filemap_alloc_folio_noprof);
|
2006-03-24 03:16:04 -08:00
|
|
|
#endif
|
|
|
|
|
2021-05-24 13:02:30 +02:00
|
|
|
/*
|
|
|
|
* filemap_invalidate_lock_two - lock invalidate_lock for two mappings
|
|
|
|
*
|
|
|
|
* Lock exclusively invalidate_lock of any passed mapping that is not NULL.
|
|
|
|
*
|
|
|
|
* @mapping1: the first mapping to lock
|
|
|
|
* @mapping2: the second mapping to lock
|
|
|
|
*/
|
|
|
|
void filemap_invalidate_lock_two(struct address_space *mapping1,
|
|
|
|
struct address_space *mapping2)
|
|
|
|
{
|
|
|
|
if (mapping1 > mapping2)
|
|
|
|
swap(mapping1, mapping2);
|
|
|
|
if (mapping1)
|
|
|
|
down_write(&mapping1->invalidate_lock);
|
|
|
|
if (mapping2 && mapping1 != mapping2)
|
|
|
|
down_write_nested(&mapping2->invalidate_lock, 1);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_invalidate_lock_two);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* filemap_invalidate_unlock_two - unlock invalidate_lock for two mappings
|
|
|
|
*
|
|
|
|
* Unlock exclusive invalidate_lock of any passed mapping that is not NULL.
|
|
|
|
*
|
|
|
|
* @mapping1: the first mapping to unlock
|
|
|
|
* @mapping2: the second mapping to unlock
|
|
|
|
*/
|
|
|
|
void filemap_invalidate_unlock_two(struct address_space *mapping1,
|
|
|
|
struct address_space *mapping2)
|
|
|
|
{
|
|
|
|
if (mapping1)
|
|
|
|
up_write(&mapping1->invalidate_lock);
|
|
|
|
if (mapping2 && mapping1 != mapping2)
|
|
|
|
up_write(&mapping2->invalidate_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_invalidate_unlock_two);
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* In order to wait for pages to become available there must be
|
|
|
|
* waitqueues associated with pages. By using a hash table of
|
|
|
|
* waitqueues where the bucket discipline is to maintain all
|
|
|
|
* waiters on the same queue and wake all when any of the pages
|
|
|
|
* become available, and for the woken contexts to check to be
|
|
|
|
* sure the appropriate page became available, this saves space
|
|
|
|
* at a cost of "thundering herd" phenomena during rare hash
|
|
|
|
* collisions.
|
|
|
|
*/
|
2016-12-25 13:00:30 +10:00
|
|
|
#define PAGE_WAIT_TABLE_BITS 8
|
|
|
|
#define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
|
2021-01-16 11:22:14 -05:00
|
|
|
static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;
|
2016-12-25 13:00:30 +10:00
|
|
|
|
2021-01-16 11:22:14 -05:00
|
|
|
static wait_queue_head_t *folio_waitqueue(struct folio *folio)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2021-01-16 11:22:14 -05:00
|
|
|
return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2016-12-25 13:00:30 +10:00
|
|
|
void __init pagecache_init(void)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2016-12-25 13:00:30 +10:00
|
|
|
int i;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2016-12-25 13:00:30 +10:00
|
|
|
for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++)
|
2021-01-16 11:22:14 -05:00
|
|
|
init_waitqueue_head(&folio_wait_table[i]);
|
2016-12-25 13:00:30 +10:00
|
|
|
|
|
|
|
page_writeback_init();
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
/*
|
|
|
|
* The page wait code treats the "wait->flags" somewhat unusually, because
|
2020-09-20 10:38:47 -07:00
|
|
|
* we have multiple different kinds of waits, not just the usual "exclusive"
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
* one.
|
|
|
|
*
|
|
|
|
* We have:
|
|
|
|
*
|
|
|
|
* (a) no special bits set:
|
|
|
|
*
|
|
|
|
* We're just waiting for the bit to be released, and when a waker
|
|
|
|
* calls the wakeup function, we set WQ_FLAG_WOKEN and wake it up,
|
|
|
|
* and remove it from the wait queue.
|
|
|
|
*
|
|
|
|
* Simple and straightforward.
|
|
|
|
*
|
|
|
|
* (b) WQ_FLAG_EXCLUSIVE:
|
|
|
|
*
|
|
|
|
* The waiter is waiting to get the lock, and only one waiter should
|
|
|
|
* be woken up to avoid any thundering herd behavior. We'll set the
|
|
|
|
* WQ_FLAG_WOKEN bit, wake it up, and remove it from the wait queue.
|
|
|
|
*
|
|
|
|
* This is the traditional exclusive wait.
|
|
|
|
*
|
2020-09-20 10:38:47 -07:00
|
|
|
* (c) WQ_FLAG_EXCLUSIVE | WQ_FLAG_CUSTOM:
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
*
|
|
|
|
* The waiter is waiting to get the bit, and additionally wants the
|
|
|
|
* lock to be transferred to it for fair lock behavior. If the lock
|
|
|
|
* cannot be taken, we stop walking the wait queue without waking
|
|
|
|
* the waiter.
|
|
|
|
*
|
|
|
|
* This is the "fair lock handoff" case, and in addition to setting
|
|
|
|
* WQ_FLAG_WOKEN, we set WQ_FLAG_DONE to let the waiter easily see
|
|
|
|
* that it now has the lock.
|
|
|
|
*/
|
2017-06-20 12:06:13 +02:00
|
|
|
static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
|
2011-05-24 17:11:29 -07:00
|
|
|
{
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
unsigned int flags;
|
2016-12-25 13:00:30 +10:00
|
|
|
struct wait_page_key *key = arg;
|
|
|
|
struct wait_page_queue *wait_page
|
|
|
|
= container_of(wait, struct wait_page_queue, wait);
|
|
|
|
|
2020-08-03 13:01:22 -07:00
|
|
|
if (!wake_page_match(wait_page, key))
|
2016-12-25 13:00:30 +10:00
|
|
|
return 0;
|
Minor page waitqueue cleanups
Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.
Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.
In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.
That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:
(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.
(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.
Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-27 13:55:12 -07:00
|
|
|
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
/*
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
* If it's a lock handoff wait, we get the bit for it, and
|
|
|
|
* stop walking (and do not wake it up) if we can't.
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
*/
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
flags = wait->flags;
|
|
|
|
if (flags & WQ_FLAG_EXCLUSIVE) {
|
2021-01-16 11:22:14 -05:00
|
|
|
if (test_bit(key->bit_nr, &key->folio->flags))
|
2020-07-23 10:16:49 -07:00
|
|
|
return -1;
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
if (flags & WQ_FLAG_CUSTOM) {
|
2021-01-16 11:22:14 -05:00
|
|
|
if (test_and_set_bit(key->bit_nr, &key->folio->flags))
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
return -1;
|
|
|
|
flags |= WQ_FLAG_DONE;
|
|
|
|
}
|
2020-07-23 10:16:49 -07:00
|
|
|
}
|
2011-05-24 17:11:29 -07:00
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
/*
|
|
|
|
* We are holding the wait-queue lock, but the waiter that
|
|
|
|
* is waiting for this will be checking the flags without
|
|
|
|
* any locking.
|
|
|
|
*
|
|
|
|
* So update the flags atomically, and wake up the waiter
|
|
|
|
* afterwards to avoid any races. This store-release pairs
|
2021-03-04 12:02:54 -05:00
|
|
|
* with the load-acquire in folio_wait_bit_common().
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
*/
|
|
|
|
smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN);
|
2020-07-23 10:16:49 -07:00
|
|
|
wake_up_state(wait->private, mode);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ok, we have successfully done what we're waiting for,
|
|
|
|
* and we can unconditionally remove the wait entry.
|
|
|
|
*
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
* Note that this pairs with the "finish_wait()" in the
|
|
|
|
* waiter, and has to be the absolute last thing we do.
|
|
|
|
* After this list_del_init(&wait->entry) the wait entry
|
2020-07-23 10:16:49 -07:00
|
|
|
* might be de-allocated and the process might even have
|
|
|
|
* exited.
|
|
|
|
*/
|
2020-07-23 12:33:41 -07:00
|
|
|
list_del_init_careful(&wait->entry);
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
return (flags & WQ_FLAG_EXCLUSIVE) != 0;
|
2011-05-24 17:11:29 -07:00
|
|
|
}
|
|
|
|
|
2021-01-15 17:14:48 -05:00
|
|
|
static void folio_wake_bit(struct folio *folio, int bit_nr)
|
2014-09-25 13:55:19 +10:00
|
|
|
{
|
2021-01-16 11:22:14 -05:00
|
|
|
wait_queue_head_t *q = folio_waitqueue(folio);
|
2016-12-25 13:00:30 +10:00
|
|
|
struct wait_page_key key;
|
|
|
|
unsigned long flags;
|
2014-09-25 13:55:19 +10:00
|
|
|
|
2021-01-16 11:22:14 -05:00
|
|
|
key.folio = folio;
|
2016-12-25 13:00:30 +10:00
|
|
|
key.bit_nr = bit_nr;
|
|
|
|
key.page_match = 0;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&q->lock, flags);
|
2023-10-10 04:58:28 +01:00
|
|
|
__wake_up_locked_key(q, TASK_NORMAL, &key);
|
2017-08-25 09:13:55 -07:00
|
|
|
|
2016-12-25 13:00:30 +10:00
|
|
|
/*
|
2022-03-24 18:09:49 -07:00
|
|
|
* It's possible to miss clearing waiters here, when we woke our page
|
|
|
|
* waiters, but the hashed waitqueue has waiters for other pages on it.
|
|
|
|
* That's okay, it's a rare case. The next waker will clear it.
|
2016-12-25 13:00:30 +10:00
|
|
|
*
|
2022-03-24 18:09:49 -07:00
|
|
|
* Note that, depending on the page pool (buddy, hugetlb, ZONE_DEVICE,
|
|
|
|
* other), the flag may be cleared in the course of freeing the page;
|
|
|
|
* but that is not required for correctness.
|
2016-12-25 13:00:30 +10:00
|
|
|
*/
|
2022-03-24 18:09:49 -07:00
|
|
|
if (!waitqueue_active(q) || !key.page_match)
|
2021-01-15 17:14:48 -05:00
|
|
|
folio_clear_waiters(folio);
|
2022-03-24 18:09:49 -07:00
|
|
|
|
2016-12-25 13:00:30 +10:00
|
|
|
spin_unlock_irqrestore(&q->lock, flags);
|
|
|
|
}
|
2017-02-22 15:44:41 -08:00
|
|
|
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
/*
|
2021-03-04 12:02:54 -05:00
|
|
|
* A choice of three behaviors for folio_wait_bit_common():
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
*/
|
|
|
|
enum behavior {
|
|
|
|
EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
|
2021-03-01 19:38:25 -05:00
|
|
|
* __folio_lock() waiting on then setting PG_locked.
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
*/
|
|
|
|
SHARED, /* Hold ref to page and check the bit when woken, like
|
2021-08-16 23:36:31 -04:00
|
|
|
* folio_wait_writeback() waiting on PG_writeback.
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
*/
|
|
|
|
DROP, /* Drop ref to page before wait, no check when woken,
|
2021-08-16 23:36:31 -04:00
|
|
|
* like folio_put_wait_locked() on PG_locked.
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
*/
|
|
|
|
};
|
|
|
|
|
2020-07-23 10:16:49 -07:00
|
|
|
/*
|
2021-03-04 12:02:54 -05:00
|
|
|
* Attempt to check (or get) the folio flag, and mark us done
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
* if successful.
|
2020-07-23 10:16:49 -07:00
|
|
|
*/
|
2021-03-04 12:02:54 -05:00
|
|
|
static inline bool folio_trylock_flag(struct folio *folio, int bit_nr,
|
2020-07-23 10:16:49 -07:00
|
|
|
struct wait_queue_entry *wait)
|
|
|
|
{
|
|
|
|
if (wait->flags & WQ_FLAG_EXCLUSIVE) {
|
2021-03-04 12:02:54 -05:00
|
|
|
if (test_and_set_bit(bit_nr, &folio->flags))
|
2020-07-23 10:16:49 -07:00
|
|
|
return false;
|
2021-03-04 12:02:54 -05:00
|
|
|
} else if (test_bit(bit_nr, &folio->flags))
|
2020-07-23 10:16:49 -07:00
|
|
|
return false;
|
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
wait->flags |= WQ_FLAG_WOKEN | WQ_FLAG_DONE;
|
2020-07-23 10:16:49 -07:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
/* How many times do we accept lock stealing from under a waiter? */
|
|
|
|
int sysctl_page_lock_unfairness = 5;
|
|
|
|
|
2021-03-04 12:02:54 -05:00
|
|
|
static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
|
|
|
|
int state, enum behavior behavior)
|
2016-12-25 13:00:30 +10:00
|
|
|
{
|
2021-01-16 11:22:14 -05:00
|
|
|
wait_queue_head_t *q = folio_waitqueue(folio);
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
int unfairness = sysctl_page_lock_unfairness;
|
2016-12-25 13:00:30 +10:00
|
|
|
struct wait_page_queue wait_page;
|
2017-06-20 12:06:13 +02:00
|
|
|
wait_queue_entry_t *wait = &wait_page.wait;
|
2018-10-26 15:06:08 -07:00
|
|
|
bool thrashing = false;
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:06:27 -07:00
|
|
|
unsigned long pflags;
|
2022-08-15 07:11:35 +00:00
|
|
|
bool in_thrashing;
|
2016-12-25 13:00:30 +10:00
|
|
|
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:06:27 -07:00
|
|
|
if (bit_nr == PG_locked &&
|
2021-03-04 12:02:54 -05:00
|
|
|
!folio_test_uptodate(folio) && folio_test_workingset(folio)) {
|
2022-08-15 07:11:35 +00:00
|
|
|
delayacct_thrashing_start(&in_thrashing);
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:06:27 -07:00
|
|
|
psi_memstall_enter(&pflags);
|
2018-10-26 15:06:08 -07:00
|
|
|
thrashing = true;
|
|
|
|
}
|
|
|
|
|
2016-12-25 13:00:30 +10:00
|
|
|
init_wait(wait);
|
|
|
|
wait->func = wake_page_function;
|
2021-01-16 11:22:14 -05:00
|
|
|
wait_page.folio = folio;
|
2016-12-25 13:00:30 +10:00
|
|
|
wait_page.bit_nr = bit_nr;
|
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
repeat:
|
|
|
|
wait->flags = 0;
|
|
|
|
if (behavior == EXCLUSIVE) {
|
|
|
|
wait->flags = WQ_FLAG_EXCLUSIVE;
|
|
|
|
if (--unfairness < 0)
|
|
|
|
wait->flags |= WQ_FLAG_CUSTOM;
|
|
|
|
}
|
|
|
|
|
2020-07-23 10:16:49 -07:00
|
|
|
/*
|
|
|
|
* Do one last check whether we can get the
|
|
|
|
* page bit synchronously.
|
|
|
|
*
|
2021-03-04 12:02:54 -05:00
|
|
|
* Do the folio_set_waiters() marking before that
|
2020-07-23 10:16:49 -07:00
|
|
|
* to let any waker we _just_ missed know they
|
|
|
|
* need to wake us up (otherwise they'll never
|
|
|
|
* even go to the slow case that looks at the
|
|
|
|
* page queue), and add ourselves to the wait
|
|
|
|
* queue if we need to sleep.
|
|
|
|
*
|
|
|
|
* This part needs to be done under the queue
|
|
|
|
* lock to avoid races.
|
|
|
|
*/
|
|
|
|
spin_lock_irq(&q->lock);
|
2021-03-04 12:02:54 -05:00
|
|
|
folio_set_waiters(folio);
|
|
|
|
if (!folio_trylock_flag(folio, bit_nr, wait))
|
2020-07-23 10:16:49 -07:00
|
|
|
__add_wait_queue_entry_tail(q, wait);
|
|
|
|
spin_unlock_irq(&q->lock);
|
2016-12-25 13:00:30 +10:00
|
|
|
|
2020-07-23 10:16:49 -07:00
|
|
|
/*
|
|
|
|
* From now on, all the logic will be based on
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
* the WQ_FLAG_WOKEN and WQ_FLAG_DONE flag, to
|
|
|
|
* see whether the page bit testing has already
|
|
|
|
* been done by the wake function.
|
2020-07-23 10:16:49 -07:00
|
|
|
*
|
2021-03-04 12:02:54 -05:00
|
|
|
* We can drop our reference to the folio.
|
2020-07-23 10:16:49 -07:00
|
|
|
*/
|
|
|
|
if (behavior == DROP)
|
2021-03-04 12:02:54 -05:00
|
|
|
folio_put(folio);
|
2016-12-25 13:00:30 +10:00
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
/*
|
|
|
|
* Note that until the "finish_wait()", or until
|
|
|
|
* we see the WQ_FLAG_WOKEN flag, we need to
|
|
|
|
* be very careful with the 'wait->flags', because
|
|
|
|
* we may race with a waker that sets them.
|
|
|
|
*/
|
2020-07-23 10:16:49 -07:00
|
|
|
for (;;) {
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
unsigned int flags;
|
|
|
|
|
2016-12-25 13:00:30 +10:00
|
|
|
set_current_state(state);
|
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
/* Loop until we've been woken or interrupted */
|
|
|
|
flags = smp_load_acquire(&wait->flags);
|
|
|
|
if (!(flags & WQ_FLAG_WOKEN)) {
|
|
|
|
if (signal_pending_state(state, current))
|
|
|
|
break;
|
|
|
|
|
|
|
|
io_schedule();
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If we were non-exclusive, we're done */
|
|
|
|
if (behavior != EXCLUSIVE)
|
2017-08-27 16:25:09 -07:00
|
|
|
break;
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
/* If the waker got the lock for us, we're done */
|
|
|
|
if (flags & WQ_FLAG_DONE)
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
break;
|
2020-07-23 10:16:49 -07:00
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
/*
|
|
|
|
* Otherwise, if we're getting the lock, we need to
|
|
|
|
* try to get it ourselves.
|
|
|
|
*
|
|
|
|
* And if that fails, we'll have to retry this all.
|
|
|
|
*/
|
2021-03-04 12:02:54 -05:00
|
|
|
if (unlikely(test_and_set_bit(bit_nr, folio_flags(folio, 0))))
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
goto repeat;
|
|
|
|
|
|
|
|
wait->flags |= WQ_FLAG_DONE;
|
|
|
|
break;
|
2016-12-25 13:00:30 +10:00
|
|
|
}
|
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
/*
|
|
|
|
* If a signal happened, this 'finish_wait()' may remove the last
|
2021-03-04 12:02:54 -05:00
|
|
|
* waiter from the wait-queues, but the folio waiters bit will remain
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
* set. That's ok. The next wakeup will take care of it, and trying
|
|
|
|
* to do it here would be difficult and prone to races.
|
|
|
|
*/
|
2016-12-25 13:00:30 +10:00
|
|
|
finish_wait(q, wait);
|
|
|
|
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:06:27 -07:00
|
|
|
if (thrashing) {
|
2022-08-15 07:11:35 +00:00
|
|
|
delayacct_thrashing_end(&in_thrashing);
|
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:06:27 -07:00
|
|
|
psi_memstall_leave(&pflags);
|
|
|
|
}
|
2018-10-26 15:06:08 -07:00
|
|
|
|
2016-12-25 13:00:30 +10:00
|
|
|
/*
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
* NOTE! The wait->flags weren't stable until we've done the
|
|
|
|
* 'finish_wait()', and we could have exited the loop above due
|
|
|
|
* to a signal, and had a wakeup event happen after the signal
|
|
|
|
* test but before the 'finish_wait()'.
|
|
|
|
*
|
|
|
|
* So only after the finish_wait() can we reliably determine
|
|
|
|
* if we got woken up or not, so we can now figure out the final
|
|
|
|
* return value based on that state without races.
|
|
|
|
*
|
|
|
|
* Also note that WQ_FLAG_WOKEN is sufficient for a non-exclusive
|
|
|
|
* waiter, but an exclusive one requires WQ_FLAG_DONE.
|
2016-12-25 13:00:30 +10:00
|
|
|
*/
|
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-13 14:05:35 -07:00
|
|
|
if (behavior == EXCLUSIVE)
|
|
|
|
return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR;
|
2016-12-25 13:00:30 +10:00
|
|
|
|
2020-07-23 10:16:49 -07:00
|
|
|
return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
|
2016-12-25 13:00:30 +10:00
|
|
|
}
|
|
|
|
|
2022-01-21 22:10:46 -08:00
|
|
|
#ifdef CONFIG_MIGRATION
|
|
|
|
/**
|
|
|
|
* migration_entry_wait_on_locked - Wait for a migration entry to be removed
|
|
|
|
* @entry: migration swap entry.
|
|
|
|
* @ptl: already locked ptl. This function will drop the lock.
|
|
|
|
*
|
|
|
|
* Wait for a migration entry referencing the given page to be removed. This is
|
|
|
|
* equivalent to put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE) except
|
|
|
|
* this can be called without taking a reference on the page. Instead this
|
|
|
|
* should be called while holding the ptl for the migration entry referencing
|
|
|
|
* the page.
|
|
|
|
*
|
2023-06-08 18:08:20 -07:00
|
|
|
* Returns after unlocking the ptl.
|
2022-01-21 22:10:46 -08:00
|
|
|
*
|
|
|
|
* This follows the same logic as folio_wait_bit_common() so see the comments
|
|
|
|
* there.
|
|
|
|
*/
|
2023-06-08 18:08:20 -07:00
|
|
|
void migration_entry_wait_on_locked(swp_entry_t entry, spinlock_t *ptl)
|
|
|
|
__releases(ptl)
|
2022-01-21 22:10:46 -08:00
|
|
|
{
|
|
|
|
struct wait_page_queue wait_page;
|
|
|
|
wait_queue_entry_t *wait = &wait_page.wait;
|
|
|
|
bool thrashing = false;
|
|
|
|
unsigned long pflags;
|
2022-08-15 07:11:35 +00:00
|
|
|
bool in_thrashing;
|
2022-01-21 22:10:46 -08:00
|
|
|
wait_queue_head_t *q;
|
2024-01-11 15:24:20 +00:00
|
|
|
struct folio *folio = pfn_swap_entry_folio(entry);
|
2022-01-21 22:10:46 -08:00
|
|
|
|
|
|
|
q = folio_waitqueue(folio);
|
|
|
|
if (!folio_test_uptodate(folio) && folio_test_workingset(folio)) {
|
2022-08-15 07:11:35 +00:00
|
|
|
delayacct_thrashing_start(&in_thrashing);
|
2022-01-21 22:10:46 -08:00
|
|
|
psi_memstall_enter(&pflags);
|
|
|
|
thrashing = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
init_wait(wait);
|
|
|
|
wait->func = wake_page_function;
|
|
|
|
wait_page.folio = folio;
|
|
|
|
wait_page.bit_nr = PG_locked;
|
|
|
|
wait->flags = 0;
|
|
|
|
|
|
|
|
spin_lock_irq(&q->lock);
|
|
|
|
folio_set_waiters(folio);
|
|
|
|
if (!folio_trylock_flag(folio, PG_locked, wait))
|
|
|
|
__add_wait_queue_entry_tail(q, wait);
|
|
|
|
spin_unlock_irq(&q->lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a migration entry exists for the page the migration path must hold
|
|
|
|
* a valid reference to the page, and it must take the ptl to remove the
|
|
|
|
* migration entry. So the page is valid until the ptl is dropped.
|
|
|
|
*/
|
2023-06-08 18:08:20 -07:00
|
|
|
spin_unlock(ptl);
|
2022-01-21 22:10:46 -08:00
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
unsigned int flags;
|
|
|
|
|
|
|
|
set_current_state(TASK_UNINTERRUPTIBLE);
|
|
|
|
|
|
|
|
/* Loop until we've been woken or interrupted */
|
|
|
|
flags = smp_load_acquire(&wait->flags);
|
|
|
|
if (!(flags & WQ_FLAG_WOKEN)) {
|
|
|
|
if (signal_pending_state(TASK_UNINTERRUPTIBLE, current))
|
|
|
|
break;
|
|
|
|
|
|
|
|
io_schedule();
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
finish_wait(q, wait);
|
|
|
|
|
|
|
|
if (thrashing) {
|
2022-08-15 07:11:35 +00:00
|
|
|
delayacct_thrashing_end(&in_thrashing);
|
2022-01-21 22:10:46 -08:00
|
|
|
psi_memstall_leave(&pflags);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-03-04 12:02:54 -05:00
|
|
|
void folio_wait_bit(struct folio *folio, int bit_nr)
|
2016-12-25 13:00:30 +10:00
|
|
|
{
|
2021-03-04 12:02:54 -05:00
|
|
|
folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
|
2016-12-25 13:00:30 +10:00
|
|
|
}
|
2021-03-04 12:02:54 -05:00
|
|
|
EXPORT_SYMBOL(folio_wait_bit);
|
2016-12-25 13:00:30 +10:00
|
|
|
|
2021-03-04 12:02:54 -05:00
|
|
|
int folio_wait_bit_killable(struct folio *folio, int bit_nr)
|
2016-12-25 13:00:30 +10:00
|
|
|
{
|
2021-03-04 12:02:54 -05:00
|
|
|
return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED);
|
2014-09-25 13:55:19 +10:00
|
|
|
}
|
2021-03-04 12:02:54 -05:00
|
|
|
EXPORT_SYMBOL(folio_wait_bit_killable);
|
2014-09-25 13:55:19 +10:00
|
|
|
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
/**
|
2021-08-16 23:36:31 -04:00
|
|
|
* folio_put_wait_locked - Drop a reference and wait for it to be unlocked
|
|
|
|
* @folio: The folio to wait for.
|
2021-02-24 12:02:02 -08:00
|
|
|
* @state: The sleep state (TASK_KILLABLE, TASK_UNINTERRUPTIBLE, etc).
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
*
|
2021-08-16 23:36:31 -04:00
|
|
|
* The caller should hold a reference on @folio. They expect the page to
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
* become unlocked relatively soon, but do not wish to hold up migration
|
2021-08-16 23:36:31 -04:00
|
|
|
* (for example) by holding the reference while waiting for the folio to
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
* come unlocked. After this function returns, the caller should not
|
2021-08-16 23:36:31 -04:00
|
|
|
* dereference @folio.
|
2021-02-24 12:02:02 -08:00
|
|
|
*
|
2021-08-16 23:36:31 -04:00
|
|
|
* Return: 0 if the folio was unlocked or -EINTR if interrupted by a signal.
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
*/
|
2022-09-14 10:17:38 +08:00
|
|
|
static int folio_put_wait_locked(struct folio *folio, int state)
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
{
|
2021-08-16 23:36:31 -04:00
|
|
|
return folio_wait_bit_common(folio, PG_locked, state, DROP);
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
}
|
|
|
|
|
2009-04-03 16:42:39 +01:00
|
|
|
/**
|
2021-01-16 11:22:14 -05:00
|
|
|
* folio_add_wait_queue - Add an arbitrary waiter to a folio's wait queue
|
|
|
|
* @folio: Folio defining the wait queue of interest
|
2009-04-13 14:39:54 -07:00
|
|
|
* @waiter: Waiter to add to the queue
|
2009-04-03 16:42:39 +01:00
|
|
|
*
|
2021-01-16 11:22:14 -05:00
|
|
|
* Add an arbitrary @waiter to the wait queue for the nominated @folio.
|
2009-04-03 16:42:39 +01:00
|
|
|
*/
|
2021-01-16 11:22:14 -05:00
|
|
|
void folio_add_wait_queue(struct folio *folio, wait_queue_entry_t *waiter)
|
2009-04-03 16:42:39 +01:00
|
|
|
{
|
2021-01-16 11:22:14 -05:00
|
|
|
wait_queue_head_t *q = folio_waitqueue(folio);
|
2009-04-03 16:42:39 +01:00
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&q->lock, flags);
|
2017-08-28 16:45:40 -07:00
|
|
|
__add_wait_queue_entry_tail(q, waiter);
|
2021-01-16 11:22:14 -05:00
|
|
|
folio_set_waiters(folio);
|
2009-04-03 16:42:39 +01:00
|
|
|
spin_unlock_irqrestore(&q->lock, flags);
|
|
|
|
}
|
2021-01-16 11:22:14 -05:00
|
|
|
EXPORT_SYMBOL_GPL(folio_add_wait_queue);
|
2009-04-03 16:42:39 +01:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
2020-12-07 15:44:35 -05:00
|
|
|
* folio_unlock - Unlock a locked folio.
|
|
|
|
* @folio: The folio.
|
|
|
|
*
|
|
|
|
* Unlocks the folio and wakes up any thread sleeping on the page lock.
|
|
|
|
*
|
|
|
|
* Context: May be called from interrupt or process context. May not be
|
|
|
|
* called from NMI context.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2020-12-07 15:44:35 -05:00
|
|
|
void folio_unlock(struct folio *folio)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2020-12-07 15:44:35 -05:00
|
|
|
/* Bit 7 allows x86 to check the byte's sign bit */
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-27 11:40:38 -08:00
|
|
|
BUILD_BUG_ON(PG_waiters != 7);
|
2020-12-07 15:44:35 -05:00
|
|
|
BUILD_BUG_ON(PG_locked > 7);
|
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
2023-10-04 17:53:15 +01:00
|
|
|
if (folio_xor_flags_has_waiters(folio, 1 << PG_locked))
|
2021-01-15 17:14:48 -05:00
|
|
|
folio_wake_bit(folio, PG_locked);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2020-12-07 15:44:35 -05:00
|
|
|
EXPORT_SYMBOL(folio_unlock);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-10-04 17:53:03 +01:00
|
|
|
/**
|
|
|
|
* folio_end_read - End read on a folio.
|
|
|
|
* @folio: The folio.
|
|
|
|
* @success: True if all reads completed successfully.
|
|
|
|
*
|
|
|
|
* When all reads against a folio have completed, filesystems should
|
|
|
|
* call this function to let the pagecache know that no more reads
|
|
|
|
* are outstanding. This will unlock the folio and wake up any thread
|
|
|
|
* sleeping on the lock. The folio will also be marked uptodate if all
|
|
|
|
* reads succeeded.
|
|
|
|
*
|
|
|
|
* Context: May be called from interrupt or process context. May not be
|
|
|
|
* called from NMI context.
|
|
|
|
*/
|
|
|
|
void folio_end_read(struct folio *folio, bool success)
|
|
|
|
{
|
2023-10-04 17:53:15 +01:00
|
|
|
unsigned long mask = 1 << PG_locked;
|
|
|
|
|
|
|
|
/* Must be in bottom byte for x86 to work */
|
|
|
|
BUILD_BUG_ON(PG_uptodate > 7);
|
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
|
|
|
VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
|
|
|
|
|
2023-10-04 17:53:03 +01:00
|
|
|
if (likely(success))
|
2023-10-04 17:53:15 +01:00
|
|
|
mask |= 1 << PG_uptodate;
|
|
|
|
if (folio_xor_flags_has_waiters(folio, mask))
|
|
|
|
folio_wake_bit(folio, PG_locked);
|
2023-10-04 17:53:03 +01:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(folio_end_read);
|
|
|
|
|
2020-02-10 10:00:21 +00:00
|
|
|
/**
|
2021-04-22 22:58:32 -04:00
|
|
|
* folio_end_private_2 - Clear PG_private_2 and wake any waiters.
|
|
|
|
* @folio: The folio.
|
2020-02-10 10:00:21 +00:00
|
|
|
*
|
2021-04-22 22:58:32 -04:00
|
|
|
* Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
|
|
|
|
* it. The folio reference held for PG_private_2 being set is released.
|
2020-02-10 10:00:21 +00:00
|
|
|
*
|
2021-04-22 22:58:32 -04:00
|
|
|
* This is, for example, used when a netfs folio is being written to a local
|
|
|
|
* disk cache, thereby allowing writes to the cache for the same folio to be
|
2020-02-10 10:00:21 +00:00
|
|
|
* serialised.
|
|
|
|
*/
|
2021-04-22 22:58:32 -04:00
|
|
|
void folio_end_private_2(struct folio *folio)
|
2020-02-10 10:00:21 +00:00
|
|
|
{
|
2021-01-15 17:14:48 -05:00
|
|
|
VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
|
|
|
|
clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
|
|
|
|
folio_wake_bit(folio, PG_private_2);
|
|
|
|
folio_put(folio);
|
2020-02-10 10:00:21 +00:00
|
|
|
}
|
2021-04-22 22:58:32 -04:00
|
|
|
EXPORT_SYMBOL(folio_end_private_2);
|
2020-02-10 10:00:21 +00:00
|
|
|
|
|
|
|
/**
|
2021-04-22 22:58:32 -04:00
|
|
|
* folio_wait_private_2 - Wait for PG_private_2 to be cleared on a folio.
|
|
|
|
* @folio: The folio to wait on.
|
2020-02-10 10:00:21 +00:00
|
|
|
*
|
2024-03-19 11:13:26 +00:00
|
|
|
* Wait for PG_private_2 to be cleared on a folio.
|
2020-02-10 10:00:21 +00:00
|
|
|
*/
|
2021-04-22 22:58:32 -04:00
|
|
|
void folio_wait_private_2(struct folio *folio)
|
2020-02-10 10:00:21 +00:00
|
|
|
{
|
2021-03-04 12:02:54 -05:00
|
|
|
while (folio_test_private_2(folio))
|
|
|
|
folio_wait_bit(folio, PG_private_2);
|
2020-02-10 10:00:21 +00:00
|
|
|
}
|
2021-04-22 22:58:32 -04:00
|
|
|
EXPORT_SYMBOL(folio_wait_private_2);
|
2020-02-10 10:00:21 +00:00
|
|
|
|
|
|
|
/**
|
2021-04-22 22:58:32 -04:00
|
|
|
* folio_wait_private_2_killable - Wait for PG_private_2 to be cleared on a folio.
|
|
|
|
* @folio: The folio to wait on.
|
2020-02-10 10:00:21 +00:00
|
|
|
*
|
2024-03-19 11:13:26 +00:00
|
|
|
* Wait for PG_private_2 to be cleared on a folio or until a fatal signal is
|
|
|
|
* received by the calling task.
|
2020-02-10 10:00:21 +00:00
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* - 0 if successful.
|
|
|
|
* - -EINTR if a fatal signal was encountered.
|
|
|
|
*/
|
2021-04-22 22:58:32 -04:00
|
|
|
int folio_wait_private_2_killable(struct folio *folio)
|
2020-02-10 10:00:21 +00:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
2021-03-04 12:02:54 -05:00
|
|
|
while (folio_test_private_2(folio)) {
|
|
|
|
ret = folio_wait_bit_killable(folio, PG_private_2);
|
2020-02-10 10:00:21 +00:00
|
|
|
if (ret < 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2021-04-22 22:58:32 -04:00
|
|
|
EXPORT_SYMBOL(folio_wait_private_2_killable);
|
2020-02-10 10:00:21 +00:00
|
|
|
|
2006-06-23 02:03:49 -07:00
|
|
|
/**
|
2021-03-03 15:21:55 -05:00
|
|
|
* folio_end_writeback - End writeback against a folio.
|
|
|
|
* @folio: The folio.
|
2023-10-04 17:53:16 +01:00
|
|
|
*
|
|
|
|
* The folio must actually be under writeback.
|
|
|
|
*
|
|
|
|
* Context: May be called from process or interrupt context.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2021-03-03 15:21:55 -05:00
|
|
|
void folio_end_writeback(struct folio *folio)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2023-10-04 17:53:16 +01:00
|
|
|
VM_BUG_ON_FOLIO(!folio_test_writeback(folio), folio);
|
|
|
|
|
2014-06-04 16:10:34 -07:00
|
|
|
/*
|
2021-03-03 15:21:55 -05:00
|
|
|
* folio_test_clear_reclaim() could be used here but it is an
|
|
|
|
* atomic operation and overkill in this particular case. Failing
|
|
|
|
* to shuffle a folio marked for immediate reclaim is too mild
|
|
|
|
* a gain to justify taking an atomic operation penalty at the
|
|
|
|
* end of every folio writeback.
|
2014-06-04 16:10:34 -07:00
|
|
|
*/
|
2021-03-03 15:21:55 -05:00
|
|
|
if (folio_test_reclaim(folio)) {
|
|
|
|
folio_clear_reclaim(folio);
|
2020-12-08 01:25:39 -05:00
|
|
|
folio_rotate_reclaimable(folio);
|
2014-06-04 16:10:34 -07:00
|
|
|
}
|
2008-04-28 02:12:38 -07:00
|
|
|
|
mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.
The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.
https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?
It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).
Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,
--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.
Devious"
And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.
I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().
Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai <cai@lca.pw>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-24 08:46:43 -08:00
|
|
|
/*
|
2021-03-03 15:21:55 -05:00
|
|
|
* Writeback does not hold a folio reference of its own, relying
|
mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.
The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.
https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?
It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).
Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,
--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.
Devious"
And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.
I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().
Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai <cai@lca.pw>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-24 08:46:43 -08:00
|
|
|
* on truncation to wait for the clearing of PG_writeback.
|
2021-03-03 15:21:55 -05:00
|
|
|
* But here we must make sure that the folio is not freed and
|
2023-10-04 17:53:17 +01:00
|
|
|
* reused before the folio_wake_bit().
|
mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.
The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.
https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?
It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).
Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,
--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.
Devious"
And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.
I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().
Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai <cai@lca.pw>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-24 08:46:43 -08:00
|
|
|
*/
|
2021-03-03 15:21:55 -05:00
|
|
|
folio_get(folio);
|
2023-10-04 17:53:17 +01:00
|
|
|
if (__folio_end_writeback(folio))
|
|
|
|
folio_wake_bit(folio, PG_writeback);
|
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
2021-11-06 14:08:17 -07:00
|
|
|
acct_reclaim_writeback(folio);
|
2021-03-03 15:21:55 -05:00
|
|
|
folio_put(folio);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2021-03-03 15:21:55 -05:00
|
|
|
EXPORT_SYMBOL(folio_end_writeback);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2006-06-23 02:03:49 -07:00
|
|
|
/**
|
2021-03-01 19:38:25 -05:00
|
|
|
* __folio_lock - Get a lock on the folio, assuming we need to sleep to get it.
|
|
|
|
* @folio: The folio to lock
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2021-03-01 19:38:25 -05:00
|
|
|
void __folio_lock(struct folio *folio)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2021-03-04 12:02:54 -05:00
|
|
|
folio_wait_bit_common(folio, PG_locked, TASK_UNINTERRUPTIBLE,
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
EXCLUSIVE);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2021-03-01 19:38:25 -05:00
|
|
|
EXPORT_SYMBOL(__folio_lock);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2020-12-08 00:07:31 -05:00
|
|
|
int __folio_lock_killable(struct folio *folio)
|
2007-12-06 11:18:49 -05:00
|
|
|
{
|
2021-03-04 12:02:54 -05:00
|
|
|
return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE,
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
EXCLUSIVE);
|
2007-12-06 11:18:49 -05:00
|
|
|
}
|
2020-12-08 00:07:31 -05:00
|
|
|
EXPORT_SYMBOL_GPL(__folio_lock_killable);
|
2007-12-06 11:18:49 -05:00
|
|
|
|
2020-12-30 17:58:40 -05:00
|
|
|
static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
|
2020-05-22 09:12:09 -06:00
|
|
|
{
|
2021-01-16 11:22:14 -05:00
|
|
|
struct wait_queue_head *q = folio_waitqueue(folio);
|
2023-12-05 10:29:54 +08:00
|
|
|
int ret;
|
2021-02-24 12:02:09 -08:00
|
|
|
|
2021-01-16 11:22:14 -05:00
|
|
|
wait->folio = folio;
|
2021-02-24 12:02:09 -08:00
|
|
|
wait->bit_nr = PG_locked;
|
|
|
|
|
|
|
|
spin_lock_irq(&q->lock);
|
|
|
|
__add_wait_queue_entry_tail(q, &wait->wait);
|
2020-12-30 17:58:40 -05:00
|
|
|
folio_set_waiters(folio);
|
|
|
|
ret = !folio_trylock(folio);
|
2021-02-24 12:02:09 -08:00
|
|
|
/*
|
|
|
|
* If we were successful now, we know we're still on the
|
|
|
|
* waitqueue as we're still under the lock. This means it's
|
|
|
|
* safe to remove and return success, we know the callback
|
|
|
|
* isn't going to trigger.
|
|
|
|
*/
|
|
|
|
if (!ret)
|
|
|
|
__remove_wait_queue(q, &wait->wait);
|
|
|
|
else
|
|
|
|
ret = -EIOCBQUEUED;
|
|
|
|
spin_unlock_irq(&q->lock);
|
|
|
|
return ret;
|
2020-05-22 09:12:09 -06:00
|
|
|
}
|
|
|
|
|
2014-08-06 16:07:24 -07:00
|
|
|
/*
|
|
|
|
* Return values:
|
2023-06-30 14:19:55 -07:00
|
|
|
* 0 - folio is locked.
|
|
|
|
* non-zero - folio is not locked.
|
2023-06-30 14:19:56 -07:00
|
|
|
* mmap_lock or per-VMA lock has been released (mmap_read_unlock() or
|
|
|
|
* vma_end_read()), unless flags had both FAULT_FLAG_ALLOW_RETRY and
|
|
|
|
* FAULT_FLAG_RETRY_NOWAIT set, in which case the lock is still held.
|
2014-08-06 16:07:24 -07:00
|
|
|
*
|
2023-06-30 14:19:55 -07:00
|
|
|
* If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
|
2023-06-30 14:19:56 -07:00
|
|
|
* with the folio locked and the mmap_lock/per-VMA lock is left unperturbed.
|
2014-08-06 16:07:24 -07:00
|
|
|
*/
|
2023-06-30 14:19:55 -07:00
|
|
|
vm_fault_t __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
|
2010-10-26 14:21:57 -07:00
|
|
|
{
|
2023-06-30 14:19:55 -07:00
|
|
|
unsigned int flags = vmf->flags;
|
|
|
|
|
2020-04-01 21:08:45 -07:00
|
|
|
if (fault_flag_allow_retry_first(flags)) {
|
2011-05-24 17:11:30 -07:00
|
|
|
/*
|
2023-06-30 14:19:56 -07:00
|
|
|
* CAUTION! In this case, mmap_lock/per-VMA lock is not
|
|
|
|
* released even though returning VM_FAULT_RETRY.
|
2011-05-24 17:11:30 -07:00
|
|
|
*/
|
|
|
|
if (flags & FAULT_FLAG_RETRY_NOWAIT)
|
2023-06-30 14:19:55 -07:00
|
|
|
return VM_FAULT_RETRY;
|
2011-05-24 17:11:30 -07:00
|
|
|
|
2023-06-30 14:19:56 -07:00
|
|
|
release_fault_lock(vmf);
|
2011-05-24 17:11:30 -07:00
|
|
|
if (flags & FAULT_FLAG_KILLABLE)
|
2021-03-04 10:21:02 -05:00
|
|
|
folio_wait_locked_killable(folio);
|
2011-05-24 17:11:30 -07:00
|
|
|
else
|
2021-03-04 10:21:02 -05:00
|
|
|
folio_wait_locked(folio);
|
2023-06-30 14:19:55 -07:00
|
|
|
return VM_FAULT_RETRY;
|
2020-12-14 19:05:02 -08:00
|
|
|
}
|
|
|
|
if (flags & FAULT_FLAG_KILLABLE) {
|
2021-03-18 21:39:45 -04:00
|
|
|
bool ret;
|
2011-05-24 17:11:30 -07:00
|
|
|
|
2020-12-08 00:07:31 -05:00
|
|
|
ret = __folio_lock_killable(folio);
|
2020-12-14 19:05:02 -08:00
|
|
|
if (ret) {
|
2023-06-30 14:19:56 -07:00
|
|
|
release_fault_lock(vmf);
|
2023-06-30 14:19:55 -07:00
|
|
|
return VM_FAULT_RETRY;
|
2020-12-14 19:05:02 -08:00
|
|
|
}
|
|
|
|
} else {
|
2020-12-08 00:07:31 -05:00
|
|
|
__folio_lock(folio);
|
2010-10-26 14:21:57 -07:00
|
|
|
}
|
2020-12-14 19:05:02 -08:00
|
|
|
|
2023-06-30 14:19:55 -07:00
|
|
|
return 0;
|
2010-10-26 14:21:57 -07:00
|
|
|
}
|
|
|
|
|
2014-04-03 14:47:44 -07:00
|
|
|
/**
|
2017-11-21 14:07:06 -05:00
|
|
|
* page_cache_next_miss() - Find the next gap in the page cache.
|
|
|
|
* @mapping: Mapping.
|
|
|
|
* @index: Index.
|
|
|
|
* @max_scan: Maximum range to search.
|
2014-04-03 14:47:44 -07:00
|
|
|
*
|
2017-11-21 14:07:06 -05:00
|
|
|
* Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for the
|
|
|
|
* gap with the lowest index.
|
2014-04-03 14:47:44 -07:00
|
|
|
*
|
2017-11-21 14:07:06 -05:00
|
|
|
* This function may be called under the rcu_read_lock. However, this will
|
|
|
|
* not atomically search a snapshot of the cache at a single point in time.
|
|
|
|
* For example, if a gap is created at index 5, then subsequently a gap is
|
|
|
|
* created at index 10, page_cache_next_miss covering both indices may
|
|
|
|
* return 10 if called under the rcu_read_lock.
|
2014-04-03 14:47:44 -07:00
|
|
|
*
|
2017-11-21 14:07:06 -05:00
|
|
|
* Return: The index of the gap if found, otherwise an index outside the
|
|
|
|
* range specified (in which case 'return - index >= max_scan' will be true).
|
2023-06-21 14:24:02 -07:00
|
|
|
* In the rare case of index wrap-around, 0 will be returned.
|
2014-04-03 14:47:44 -07:00
|
|
|
*/
|
2017-11-21 14:07:06 -05:00
|
|
|
pgoff_t page_cache_next_miss(struct address_space *mapping,
|
2014-04-03 14:47:44 -07:00
|
|
|
pgoff_t index, unsigned long max_scan)
|
|
|
|
{
|
2017-11-21 14:07:06 -05:00
|
|
|
XA_STATE(xas, &mapping->i_pages, index);
|
2014-04-03 14:47:44 -07:00
|
|
|
|
2017-11-21 14:07:06 -05:00
|
|
|
while (max_scan--) {
|
|
|
|
void *entry = xas_next(&xas);
|
|
|
|
if (!entry || xa_is_value(entry))
|
2024-06-25 12:18:52 +02:00
|
|
|
return xas.xa_index;
|
2023-06-21 14:24:02 -07:00
|
|
|
if (xas.xa_index == 0)
|
2024-06-25 12:18:52 +02:00
|
|
|
return 0;
|
2014-04-03 14:47:44 -07:00
|
|
|
}
|
|
|
|
|
2024-06-25 12:18:52 +02:00
|
|
|
return index + max_scan;
|
2014-04-03 14:47:44 -07:00
|
|
|
}
|
2017-11-21 14:07:06 -05:00
|
|
|
EXPORT_SYMBOL(page_cache_next_miss);
|
2014-04-03 14:47:44 -07:00
|
|
|
|
|
|
|
/**
|
2019-05-13 17:21:29 -07:00
|
|
|
* page_cache_prev_miss() - Find the previous gap in the page cache.
|
2017-11-21 14:07:06 -05:00
|
|
|
* @mapping: Mapping.
|
|
|
|
* @index: Index.
|
|
|
|
* @max_scan: Maximum range to search.
|
2014-04-03 14:47:44 -07:00
|
|
|
*
|
2017-11-21 14:07:06 -05:00
|
|
|
* Search the range [max(index - max_scan + 1, 0), index] for the
|
|
|
|
* gap with the highest index.
|
2014-04-03 14:47:44 -07:00
|
|
|
*
|
2017-11-21 14:07:06 -05:00
|
|
|
* This function may be called under the rcu_read_lock. However, this will
|
|
|
|
* not atomically search a snapshot of the cache at a single point in time.
|
|
|
|
* For example, if a gap is created at index 10, then subsequently a gap is
|
|
|
|
* created at index 5, page_cache_prev_miss() covering both indices may
|
|
|
|
* return 5 if called under the rcu_read_lock.
|
2014-04-03 14:47:44 -07:00
|
|
|
*
|
2017-11-21 14:07:06 -05:00
|
|
|
* Return: The index of the gap if found, otherwise an index outside the
|
|
|
|
* range specified (in which case 'index - return >= max_scan' will be true).
|
2023-06-21 14:24:02 -07:00
|
|
|
* In the rare case of wrap-around, ULONG_MAX will be returned.
|
2014-04-03 14:47:44 -07:00
|
|
|
*/
|
2017-11-21 14:07:06 -05:00
|
|
|
pgoff_t page_cache_prev_miss(struct address_space *mapping,
|
2014-04-03 14:47:44 -07:00
|
|
|
pgoff_t index, unsigned long max_scan)
|
|
|
|
{
|
2017-11-21 14:07:06 -05:00
|
|
|
XA_STATE(xas, &mapping->i_pages, index);
|
2014-04-03 14:47:44 -07:00
|
|
|
|
2017-11-21 14:07:06 -05:00
|
|
|
while (max_scan--) {
|
|
|
|
void *entry = xas_prev(&xas);
|
|
|
|
if (!entry || xa_is_value(entry))
|
2023-06-21 14:24:02 -07:00
|
|
|
break;
|
|
|
|
if (xas.xa_index == ULONG_MAX)
|
|
|
|
break;
|
2014-04-03 14:47:44 -07:00
|
|
|
}
|
|
|
|
|
2023-06-21 14:24:02 -07:00
|
|
|
return xas.xa_index;
|
2014-04-03 14:47:44 -07:00
|
|
|
}
|
2017-11-21 14:07:06 -05:00
|
|
|
EXPORT_SYMBOL(page_cache_prev_miss);
|
2014-04-03 14:47:44 -07:00
|
|
|
|
2021-05-10 16:33:22 -04:00
|
|
|
/*
|
|
|
|
* Lockless page cache protocol:
|
|
|
|
* On the lookup side:
|
|
|
|
* 1. Load the folio from i_pages
|
|
|
|
* 2. Increment the refcount if it's not zero
|
|
|
|
* 3. If the folio is not found by xas_reload(), put the refcount and retry
|
|
|
|
*
|
|
|
|
* On the removal side:
|
|
|
|
* A. Freeze the page (by zeroing the refcount if nobody else has a reference)
|
|
|
|
* B. Remove the page from i_pages
|
|
|
|
* C. Return the page to the page allocator
|
|
|
|
*
|
|
|
|
* This means that any page may have its reference count temporarily
|
2024-04-02 14:55:16 +02:00
|
|
|
* increased by a speculative page cache (or GUP-fast) lookup as it can
|
2021-05-10 16:33:22 -04:00
|
|
|
* be allocated by another user before the RCU grace period expires.
|
|
|
|
* Because the refcount temporarily acquired here may end up being the
|
|
|
|
* last refcount on the page, any page allocation must be freeable by
|
|
|
|
* folio_put().
|
|
|
|
*/
|
|
|
|
|
2021-02-25 17:15:36 -08:00
|
|
|
/*
|
2023-03-07 15:34:05 +01:00
|
|
|
* filemap_get_entry - Get a page cache entry.
|
2006-06-23 02:03:49 -07:00
|
|
|
* @mapping: the address_space to search
|
2020-10-13 16:51:34 -07:00
|
|
|
* @index: The page cache index.
|
2014-04-03 14:47:46 -07:00
|
|
|
*
|
2020-12-15 23:22:38 -05:00
|
|
|
* Looks up the page cache entry at @mapping & @index. If it is a folio,
|
|
|
|
* it is returned with an increased refcount. If it is a shadow entry
|
|
|
|
* of a previously evicted folio, or a swap entry from shmem/tmpfs,
|
|
|
|
* it is returned without further action.
|
2006-06-23 02:03:49 -07:00
|
|
|
*
|
2020-12-15 23:22:38 -05:00
|
|
|
* Return: The folio, swap or shadow entry, %NULL if nothing is found.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2023-03-07 15:34:05 +01:00
|
|
|
void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2020-10-13 16:51:34 -07:00
|
|
|
XA_STATE(xas, &mapping->i_pages, index);
|
2020-12-15 23:22:38 -05:00
|
|
|
struct folio *folio;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2008-07-25 19:45:31 -07:00
|
|
|
rcu_read_lock();
|
|
|
|
repeat:
|
2018-05-16 16:12:50 -04:00
|
|
|
xas_reset(&xas);
|
2020-12-15 23:22:38 -05:00
|
|
|
folio = xas_load(&xas);
|
|
|
|
if (xas_retry(&xas, folio))
|
2018-05-16 16:12:50 -04:00
|
|
|
goto repeat;
|
|
|
|
/*
|
|
|
|
* A shadow entry of a recently evicted page, or a swap entry from
|
|
|
|
* shmem/tmpfs. Return it without attempting to raise page count.
|
|
|
|
*/
|
2020-12-15 23:22:38 -05:00
|
|
|
if (!folio || xa_is_value(folio))
|
2018-05-16 16:12:50 -04:00
|
|
|
goto out;
|
2016-07-26 15:26:04 -07:00
|
|
|
|
2024-06-25 13:53:50 -07:00
|
|
|
if (!folio_try_get(folio))
|
2018-05-16 16:12:50 -04:00
|
|
|
goto repeat;
|
2016-07-26 15:26:04 -07:00
|
|
|
|
2020-12-15 23:22:38 -05:00
|
|
|
if (unlikely(folio != xas_reload(&xas))) {
|
|
|
|
folio_put(folio);
|
2018-05-16 16:12:50 -04:00
|
|
|
goto repeat;
|
2008-07-25 19:45:31 -07:00
|
|
|
}
|
2010-11-11 14:05:19 -08:00
|
|
|
out:
|
2008-07-25 19:45:31 -07:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
2020-12-15 23:22:38 -05:00
|
|
|
return folio;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2014-04-03 14:47:46 -07:00
|
|
|
/**
|
2021-03-08 11:45:35 -05:00
|
|
|
* __filemap_get_folio - Find and get a reference to a folio.
|
2020-04-01 21:05:07 -07:00
|
|
|
* @mapping: The address_space to search.
|
|
|
|
* @index: The page index.
|
2021-03-08 11:45:35 -05:00
|
|
|
* @fgp_flags: %FGP flags modify how the folio is returned.
|
|
|
|
* @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2020-04-01 21:05:07 -07:00
|
|
|
* Looks up the page cache entry at @mapping & @index.
|
2014-04-03 14:47:46 -07:00
|
|
|
*
|
2020-04-01 21:05:07 -07:00
|
|
|
* If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
|
|
|
|
* if the %GFP flags specified for %FGP_CREAT are atomic.
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2023-05-26 16:43:23 -04:00
|
|
|
* If this function returns a folio, it is returned with an increased refcount.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
2023-03-07 15:34:10 +01:00
|
|
|
* Return: The found folio or an ERR_PTR() otherwise.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2021-03-08 11:45:35 -05:00
|
|
|
struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
|
2023-05-26 16:43:23 -04:00
|
|
|
fgf_t fgp_flags, gfp_t gfp)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2021-03-08 11:45:35 -05:00
|
|
|
struct folio *folio;
|
2014-06-04 16:10:31 -07:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
repeat:
|
2023-03-07 15:34:05 +01:00
|
|
|
folio = filemap_get_entry(mapping, index);
|
2023-03-07 15:34:09 +01:00
|
|
|
if (xa_is_value(folio))
|
2021-03-08 11:45:35 -05:00
|
|
|
folio = NULL;
|
|
|
|
if (!folio)
|
2014-06-04 16:10:31 -07:00
|
|
|
goto no_page;
|
|
|
|
|
|
|
|
if (fgp_flags & FGP_LOCK) {
|
|
|
|
if (fgp_flags & FGP_NOWAIT) {
|
2021-03-08 11:45:35 -05:00
|
|
|
if (!folio_trylock(folio)) {
|
|
|
|
folio_put(folio);
|
2023-03-07 15:34:10 +01:00
|
|
|
return ERR_PTR(-EAGAIN);
|
2014-06-04 16:10:31 -07:00
|
|
|
}
|
|
|
|
} else {
|
2021-03-08 11:45:35 -05:00
|
|
|
folio_lock(folio);
|
2014-06-04 16:10:31 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Has the page been truncated? */
|
2021-03-08 11:45:35 -05:00
|
|
|
if (unlikely(folio->mapping != mapping)) {
|
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2014-06-04 16:10:31 -07:00
|
|
|
goto repeat;
|
|
|
|
}
|
2021-03-08 11:45:35 -05:00
|
|
|
VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
|
2014-06-04 16:10:31 -07:00
|
|
|
}
|
|
|
|
|
2018-12-28 00:37:35 -08:00
|
|
|
if (fgp_flags & FGP_ACCESSED)
|
2021-03-08 11:45:35 -05:00
|
|
|
folio_mark_accessed(folio);
|
2020-08-06 23:19:55 -07:00
|
|
|
else if (fgp_flags & FGP_WRITE) {
|
|
|
|
/* Clear idle flag for buffer write */
|
2021-03-08 11:45:35 -05:00
|
|
|
if (folio_test_idle(folio))
|
|
|
|
folio_clear_idle(folio);
|
2020-08-06 23:19:55 -07:00
|
|
|
}
|
2014-06-04 16:10:31 -07:00
|
|
|
|
2020-12-24 12:55:56 -05:00
|
|
|
if (fgp_flags & FGP_STABLE)
|
|
|
|
folio_wait_stable(folio);
|
2014-06-04 16:10:31 -07:00
|
|
|
no_page:
|
2021-03-08 11:45:35 -05:00
|
|
|
if (!folio && (fgp_flags & FGP_CREAT)) {
|
2024-08-22 15:50:10 +02:00
|
|
|
unsigned int min_order = mapping_min_folio_order(mapping);
|
|
|
|
unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
|
2014-06-04 16:10:31 -07:00
|
|
|
int err;
|
2024-08-22 15:50:10 +02:00
|
|
|
index = mapping_align_index(mapping, index);
|
2023-05-19 16:10:37 -04:00
|
|
|
|
2020-09-24 08:51:40 +02:00
|
|
|
if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
|
2021-03-08 11:45:35 -05:00
|
|
|
gfp |= __GFP_WRITE;
|
2014-12-29 20:30:35 +01:00
|
|
|
if (fgp_flags & FGP_NOFS)
|
2021-03-08 11:45:35 -05:00
|
|
|
gfp &= ~__GFP_FS;
|
2022-07-01 14:04:43 -06:00
|
|
|
if (fgp_flags & FGP_NOWAIT) {
|
|
|
|
gfp &= ~GFP_KERNEL;
|
|
|
|
gfp |= GFP_NOWAIT | __GFP_NOWARN;
|
|
|
|
}
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:14 -07:00
|
|
|
if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
|
2014-06-04 16:10:31 -07:00
|
|
|
fgp_flags |= FGP_LOCK;
|
|
|
|
|
2024-08-22 15:50:09 +02:00
|
|
|
if (order > mapping_max_folio_order(mapping))
|
|
|
|
order = mapping_max_folio_order(mapping);
|
2023-05-19 16:10:37 -04:00
|
|
|
/* If we're not aligned, allocate a smaller folio */
|
|
|
|
if (index & ((1UL << order) - 1))
|
|
|
|
order = __ffs(index);
|
2014-06-04 16:10:31 -07:00
|
|
|
|
2023-05-19 16:10:37 -04:00
|
|
|
do {
|
|
|
|
gfp_t alloc_gfp = gfp;
|
|
|
|
|
|
|
|
err = -ENOMEM;
|
2024-08-22 15:50:10 +02:00
|
|
|
if (order > min_order)
|
2023-05-19 16:10:37 -04:00
|
|
|
alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
|
|
|
|
folio = filemap_alloc_folio(alloc_gfp, order);
|
|
|
|
if (!folio)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Init accessed so avoid atomic mark_page_accessed later */
|
|
|
|
if (fgp_flags & FGP_ACCESSED)
|
|
|
|
__folio_set_referenced(folio);
|
|
|
|
|
|
|
|
err = filemap_add_folio(mapping, folio, index, gfp);
|
|
|
|
if (!err)
|
|
|
|
break;
|
2021-03-08 11:45:35 -05:00
|
|
|
folio_put(folio);
|
|
|
|
folio = NULL;
|
2024-08-22 15:50:10 +02:00
|
|
|
} while (order-- > min_order);
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:14 -07:00
|
|
|
|
2023-05-19 16:10:37 -04:00
|
|
|
if (err == -EEXIST)
|
|
|
|
goto repeat;
|
|
|
|
if (err)
|
|
|
|
return ERR_PTR(err);
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:14 -07:00
|
|
|
/*
|
2021-03-08 11:45:35 -05:00
|
|
|
* filemap_add_folio locks the page, and for mmap
|
|
|
|
* we expect an unlocked page.
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:14 -07:00
|
|
|
*/
|
2021-03-08 11:45:35 -05:00
|
|
|
if (folio && (fgp_flags & FGP_FOR_MMAP))
|
|
|
|
folio_unlock(folio);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2014-06-04 16:10:31 -07:00
|
|
|
|
2023-03-07 15:34:10 +01:00
|
|
|
if (!folio)
|
|
|
|
return ERR_PTR(-ENOENT);
|
2021-03-08 11:45:35 -05:00
|
|
|
return folio;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2021-03-08 11:45:35 -05:00
|
|
|
EXPORT_SYMBOL(__filemap_get_folio);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
|
2021-02-25 17:15:44 -08:00
|
|
|
xa_mark_t mark)
|
|
|
|
{
|
2020-12-17 00:12:26 -05:00
|
|
|
struct folio *folio;
|
2021-02-25 17:15:44 -08:00
|
|
|
|
|
|
|
retry:
|
|
|
|
if (mark == XA_PRESENT)
|
2020-12-17 00:12:26 -05:00
|
|
|
folio = xas_find(xas, max);
|
2021-02-25 17:15:44 -08:00
|
|
|
else
|
2020-12-17 00:12:26 -05:00
|
|
|
folio = xas_find_marked(xas, max, mark);
|
2021-02-25 17:15:44 -08:00
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
if (xas_retry(xas, folio))
|
2021-02-25 17:15:44 -08:00
|
|
|
goto retry;
|
|
|
|
/*
|
|
|
|
* A shadow entry of a recently evicted page, a swap
|
|
|
|
* entry from shmem/tmpfs or a DAX entry. Return it
|
|
|
|
* without attempting to raise page count.
|
|
|
|
*/
|
2020-12-17 00:12:26 -05:00
|
|
|
if (!folio || xa_is_value(folio))
|
|
|
|
return folio;
|
2021-02-25 17:15:44 -08:00
|
|
|
|
2024-06-25 13:53:50 -07:00
|
|
|
if (!folio_try_get(folio))
|
2021-02-25 17:15:44 -08:00
|
|
|
goto reset;
|
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
if (unlikely(folio != xas_reload(xas))) {
|
|
|
|
folio_put(folio);
|
2021-02-25 17:15:44 -08:00
|
|
|
goto reset;
|
|
|
|
}
|
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
return folio;
|
2021-02-25 17:15:44 -08:00
|
|
|
reset:
|
|
|
|
xas_reset(xas);
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
2014-04-03 14:47:46 -07:00
|
|
|
/**
|
|
|
|
* find_get_entries - gang pagecache lookup
|
|
|
|
* @mapping: The address_space to search
|
|
|
|
* @start: The starting page cache index
|
2021-02-25 17:16:00 -08:00
|
|
|
* @end: The final page index (inclusive).
|
2020-09-01 23:17:50 -04:00
|
|
|
* @fbatch: Where the resulting entries are placed.
|
2014-04-03 14:47:46 -07:00
|
|
|
* @indices: The cache indices corresponding to the entries in @entries
|
|
|
|
*
|
2021-02-25 17:16:11 -08:00
|
|
|
* find_get_entries() will search for and return a batch of entries in
|
2020-09-01 23:17:50 -04:00
|
|
|
* the mapping. The entries are placed in @fbatch. find_get_entries()
|
|
|
|
* takes a reference on any actual folios it returns.
|
2014-04-03 14:47:46 -07:00
|
|
|
*
|
2020-09-01 23:17:50 -04:00
|
|
|
* The entries have ascending indexes. The indices may not be consecutive
|
|
|
|
* due to not-present entries or large folios.
|
2014-04-03 14:47:46 -07:00
|
|
|
*
|
2020-09-01 23:17:50 -04:00
|
|
|
* Any shadow entries of evicted folios, or swap entries from
|
2014-05-06 12:50:05 -07:00
|
|
|
* shmem/tmpfs, are included in the returned array.
|
2014-04-03 14:47:46 -07:00
|
|
|
*
|
2020-09-01 23:17:50 -04:00
|
|
|
* Return: The number of entries which were found.
|
2014-04-03 14:47:46 -07:00
|
|
|
*/
|
2022-10-17 09:18:00 -07:00
|
|
|
unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
|
2020-09-01 23:17:50 -04:00
|
|
|
pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
|
2014-04-03 14:47:46 -07:00
|
|
|
{
|
2022-10-17 09:18:00 -07:00
|
|
|
XA_STATE(xas, &mapping->i_pages, *start);
|
2020-12-17 00:12:26 -05:00
|
|
|
struct folio *folio;
|
2014-04-03 14:47:46 -07:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2020-12-17 00:12:26 -05:00
|
|
|
while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) {
|
2020-09-01 23:17:50 -04:00
|
|
|
indices[fbatch->nr] = xas.xa_index;
|
|
|
|
if (!folio_batch_add(fbatch, folio))
|
2014-04-03 14:47:46 -07:00
|
|
|
break;
|
|
|
|
}
|
2021-02-25 17:16:11 -08:00
|
|
|
|
2022-10-17 09:18:00 -07:00
|
|
|
if (folio_batch_count(fbatch)) {
|
2024-08-12 15:42:05 +08:00
|
|
|
unsigned long nr;
|
2022-10-17 09:18:00 -07:00
|
|
|
int idx = folio_batch_count(fbatch) - 1;
|
|
|
|
|
|
|
|
folio = fbatch->folios[idx];
|
2023-09-26 12:20:17 -07:00
|
|
|
if (!xa_is_value(folio))
|
2022-10-17 09:18:00 -07:00
|
|
|
nr = folio_nr_pages(folio);
|
2024-08-12 15:42:05 +08:00
|
|
|
else
|
|
|
|
nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
|
|
|
|
*start = round_down(indices[idx] + nr, nr);
|
2022-10-17 09:18:00 -07:00
|
|
|
}
|
2024-08-12 15:42:05 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
2020-09-01 23:17:50 -04:00
|
|
|
return folio_batch_count(fbatch);
|
2014-04-03 14:47:46 -07:00
|
|
|
}
|
|
|
|
|
2021-02-25 17:15:56 -08:00
|
|
|
/**
|
|
|
|
* find_lock_entries - Find a batch of pagecache entries.
|
|
|
|
* @mapping: The address_space to search.
|
|
|
|
* @start: The starting page cache index.
|
|
|
|
* @end: The final page index (inclusive).
|
2021-12-07 14:15:07 -05:00
|
|
|
* @fbatch: Where the resulting entries are placed.
|
|
|
|
* @indices: The cache indices of the entries in @fbatch.
|
2021-02-25 17:15:56 -08:00
|
|
|
*
|
|
|
|
* find_lock_entries() will return a batch of entries from @mapping.
|
2020-12-17 00:12:26 -05:00
|
|
|
* Swap, shadow and DAX entries are included. Folios are returned
|
|
|
|
* locked and with an incremented refcount. Folios which are locked
|
|
|
|
* by somebody else or under writeback are skipped. Folios which are
|
|
|
|
* partially outside the range are not returned.
|
2021-02-25 17:15:56 -08:00
|
|
|
*
|
|
|
|
* The entries have ascending indexes. The indices may not be consecutive
|
2020-12-17 00:12:26 -05:00
|
|
|
* due to not-present entries, large folios, folios which could not be
|
|
|
|
* locked or folios under writeback.
|
2021-02-25 17:15:56 -08:00
|
|
|
*
|
|
|
|
* Return: The number of entries which were found.
|
|
|
|
*/
|
2022-10-17 09:17:59 -07:00
|
|
|
unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
|
2021-12-07 14:15:07 -05:00
|
|
|
pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices)
|
2021-02-25 17:15:56 -08:00
|
|
|
{
|
2022-10-17 09:17:59 -07:00
|
|
|
XA_STATE(xas, &mapping->i_pages, *start);
|
2020-12-17 00:12:26 -05:00
|
|
|
struct folio *folio;
|
2021-02-25 17:15:56 -08:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2020-12-17 00:12:26 -05:00
|
|
|
while ((folio = find_get_entry(&xas, end, XA_PRESENT))) {
|
2024-08-12 15:42:05 +08:00
|
|
|
unsigned long base;
|
|
|
|
unsigned long nr;
|
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
if (!xa_is_value(folio)) {
|
2024-08-12 15:42:05 +08:00
|
|
|
nr = folio_nr_pages(folio);
|
|
|
|
base = folio->index;
|
|
|
|
/* Omit large folio which begins before the start */
|
|
|
|
if (base < *start)
|
2021-02-25 17:15:56 -08:00
|
|
|
goto put;
|
2024-08-12 15:42:05 +08:00
|
|
|
/* Omit large folio which extends beyond the end */
|
|
|
|
if (base + nr - 1 > end)
|
2021-02-25 17:15:56 -08:00
|
|
|
goto put;
|
2020-12-17 00:12:26 -05:00
|
|
|
if (!folio_trylock(folio))
|
2021-02-25 17:15:56 -08:00
|
|
|
goto put;
|
2020-12-17 00:12:26 -05:00
|
|
|
if (folio->mapping != mapping ||
|
|
|
|
folio_test_writeback(folio))
|
2021-02-25 17:15:56 -08:00
|
|
|
goto unlock;
|
2020-12-17 00:12:26 -05:00
|
|
|
VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index),
|
|
|
|
folio);
|
2024-08-12 15:42:05 +08:00
|
|
|
} else {
|
2024-09-06 16:05:12 -07:00
|
|
|
nr = 1 << xas_get_order(&xas);
|
2024-08-12 15:42:05 +08:00
|
|
|
base = xas.xa_index & ~(nr - 1);
|
|
|
|
/* Omit order>0 value which begins before the start */
|
|
|
|
if (base < *start)
|
|
|
|
continue;
|
|
|
|
/* Omit order>0 value which extends beyond the end */
|
|
|
|
if (base + nr - 1 > end)
|
|
|
|
break;
|
2021-02-25 17:15:56 -08:00
|
|
|
}
|
2024-08-12 15:42:05 +08:00
|
|
|
|
|
|
|
/* Update start now so that last update is correct on return */
|
|
|
|
*start = base + nr;
|
2021-12-07 14:15:07 -05:00
|
|
|
indices[fbatch->nr] = xas.xa_index;
|
|
|
|
if (!folio_batch_add(fbatch, folio))
|
2021-02-25 17:15:56 -08:00
|
|
|
break;
|
2020-06-27 22:19:08 -04:00
|
|
|
continue;
|
2021-02-25 17:15:56 -08:00
|
|
|
unlock:
|
2020-12-17 00:12:26 -05:00
|
|
|
folio_unlock(folio);
|
2021-02-25 17:15:56 -08:00
|
|
|
put:
|
2020-12-17 00:12:26 -05:00
|
|
|
folio_put(folio);
|
2021-02-25 17:15:56 -08:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2021-12-07 14:15:07 -05:00
|
|
|
return folio_batch_count(fbatch);
|
2021-02-25 17:15:56 -08:00
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/**
|
2022-06-03 15:30:25 -04:00
|
|
|
* filemap_get_folios - Get a batch of folios
|
2005-04-16 15:20:36 -07:00
|
|
|
* @mapping: The address_space to search
|
|
|
|
* @start: The starting page index
|
2017-09-06 16:21:21 -07:00
|
|
|
* @end: The final page index (inclusive)
|
2022-06-03 15:30:25 -04:00
|
|
|
* @fbatch: The batch to fill.
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2022-06-03 15:30:25 -04:00
|
|
|
* Search for and return a batch of folios in the mapping starting at
|
|
|
|
* index @start and up to index @end (inclusive). The folios are returned
|
|
|
|
* in @fbatch with an elevated reference count.
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2022-06-03 15:30:25 -04:00
|
|
|
* Return: The number of folios which were found.
|
|
|
|
* We also update @start to index the next folio for the traversal.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2022-06-03 15:30:25 -04:00
|
|
|
unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start,
|
|
|
|
pgoff_t end, struct folio_batch *fbatch)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2023-10-06 13:01:20 +02:00
|
|
|
return filemap_get_folios_tag(mapping, start, end, XA_PRESENT, fbatch);
|
2022-06-03 15:30:25 -04:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_get_folios);
|
|
|
|
|
2006-04-27 08:46:01 +02:00
|
|
|
/**
|
2022-08-23 17:40:17 -07:00
|
|
|
* filemap_get_folios_contig - Get a batch of contiguous folios
|
2006-04-27 08:46:01 +02:00
|
|
|
* @mapping: The address_space to search
|
2022-08-23 17:40:17 -07:00
|
|
|
* @start: The starting page index
|
|
|
|
* @end: The final page index (inclusive)
|
|
|
|
* @fbatch: The batch to fill
|
2006-04-27 08:46:01 +02:00
|
|
|
*
|
2022-08-23 17:40:17 -07:00
|
|
|
* filemap_get_folios_contig() works exactly like filemap_get_folios(),
|
|
|
|
* except the returned folios are guaranteed to be contiguous. This may
|
|
|
|
* not return all contiguous folios if the batch gets filled up.
|
2006-04-27 08:46:01 +02:00
|
|
|
*
|
2022-08-23 17:40:17 -07:00
|
|
|
* Return: The number of folios found.
|
|
|
|
* Also update @start to be positioned for traversal of the next folio.
|
2006-04-27 08:46:01 +02:00
|
|
|
*/
|
2022-08-23 17:40:17 -07:00
|
|
|
|
|
|
|
unsigned filemap_get_folios_contig(struct address_space *mapping,
|
|
|
|
pgoff_t *start, pgoff_t end, struct folio_batch *fbatch)
|
2006-04-27 08:46:01 +02:00
|
|
|
{
|
2022-08-23 17:40:17 -07:00
|
|
|
XA_STATE(xas, &mapping->i_pages, *start);
|
|
|
|
unsigned long nr;
|
2021-03-06 16:38:38 -05:00
|
|
|
struct folio *folio;
|
2008-07-25 19:45:31 -07:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2022-08-23 17:40:17 -07:00
|
|
|
|
|
|
|
for (folio = xas_load(&xas); folio && xas.xa_index <= end;
|
|
|
|
folio = xas_next(&xas)) {
|
2021-03-06 16:38:38 -05:00
|
|
|
if (xas_retry(&xas, folio))
|
2018-05-16 18:00:33 -04:00
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* If the entry has been swapped out, we can stop looking.
|
|
|
|
* No current caller is looking for DAX entries.
|
|
|
|
*/
|
2021-03-06 16:38:38 -05:00
|
|
|
if (xa_is_value(folio))
|
2022-08-23 17:40:17 -07:00
|
|
|
goto update_start;
|
2006-04-27 08:46:01 +02:00
|
|
|
|
2024-09-03 07:25:17 -07:00
|
|
|
/* If we landed in the middle of a THP, continue at its end. */
|
|
|
|
if (xa_is_sibling(folio))
|
|
|
|
goto update_start;
|
|
|
|
|
2024-06-25 13:53:50 -07:00
|
|
|
if (!folio_try_get(folio))
|
2018-05-16 18:00:33 -04:00
|
|
|
goto retry;
|
2016-07-26 15:26:04 -07:00
|
|
|
|
2021-03-06 16:38:38 -05:00
|
|
|
if (unlikely(folio != xas_reload(&xas)))
|
2022-08-23 17:40:17 -07:00
|
|
|
goto put_folio;
|
2008-07-25 19:45:31 -07:00
|
|
|
|
2022-08-23 17:40:17 -07:00
|
|
|
if (!folio_batch_add(fbatch, folio)) {
|
|
|
|
nr = folio_nr_pages(folio);
|
|
|
|
*start = folio->index + nr;
|
|
|
|
goto out;
|
2020-06-27 22:19:08 -04:00
|
|
|
}
|
2018-05-16 18:00:33 -04:00
|
|
|
continue;
|
2022-08-23 17:40:17 -07:00
|
|
|
put_folio:
|
2021-03-06 16:38:38 -05:00
|
|
|
folio_put(folio);
|
2022-08-23 17:40:17 -07:00
|
|
|
|
2018-05-16 18:00:33 -04:00
|
|
|
retry:
|
|
|
|
xas_reset(&xas);
|
2006-04-27 08:46:01 +02:00
|
|
|
}
|
2022-08-23 17:40:17 -07:00
|
|
|
|
|
|
|
update_start:
|
|
|
|
nr = folio_batch_count(fbatch);
|
|
|
|
|
|
|
|
if (nr) {
|
|
|
|
folio = fbatch->folios[nr - 1];
|
2023-11-07 10:46:34 +08:00
|
|
|
*start = folio_next_index(folio);
|
2022-08-23 17:40:17 -07:00
|
|
|
}
|
|
|
|
out:
|
2008-07-25 19:45:31 -07:00
|
|
|
rcu_read_unlock();
|
2022-08-23 17:40:17 -07:00
|
|
|
return folio_batch_count(fbatch);
|
2006-04-27 08:46:01 +02:00
|
|
|
}
|
2022-08-23 17:40:17 -07:00
|
|
|
EXPORT_SYMBOL(filemap_get_folios_contig);
|
2006-04-27 08:46:01 +02:00
|
|
|
|
2006-06-23 02:03:49 -07:00
|
|
|
/**
|
2023-01-04 13:14:27 -08:00
|
|
|
* filemap_get_folios_tag - Get a batch of folios matching @tag
|
|
|
|
* @mapping: The address_space to search
|
|
|
|
* @start: The starting page index
|
|
|
|
* @end: The final page index (inclusive)
|
|
|
|
* @tag: The tag index
|
|
|
|
* @fbatch: The batch to fill
|
2006-06-23 02:03:49 -07:00
|
|
|
*
|
2023-10-06 13:01:20 +02:00
|
|
|
* The first folio may start before @start; if it does, it will contain
|
|
|
|
* @start. The final folio may extend beyond @end; if it does, it will
|
|
|
|
* contain @end. The folios have ascending indices. There may be gaps
|
|
|
|
* between the folios if there are indices which have no folio in the
|
|
|
|
* page cache. If folios are added to or removed from the page cache
|
|
|
|
* while this is running, they may or may not be found by this call.
|
|
|
|
* Only returns folios that are tagged with @tag.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
2023-01-04 13:14:27 -08:00
|
|
|
* Return: The number of folios found.
|
|
|
|
* Also update @start to index the next folio for traversal.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2023-01-04 13:14:27 -08:00
|
|
|
unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start,
|
|
|
|
pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2023-01-04 13:14:27 -08:00
|
|
|
XA_STATE(xas, &mapping->i_pages, *start);
|
2020-12-17 00:12:26 -05:00
|
|
|
struct folio *folio;
|
2008-07-25 19:45:31 -07:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2023-01-04 13:14:27 -08:00
|
|
|
while ((folio = find_get_entry(&xas, end, tag)) != NULL) {
|
2018-05-16 18:12:54 -04:00
|
|
|
/*
|
|
|
|
* Shadow entries should never be tagged, but this iteration
|
|
|
|
* is lockless so there is a window for page reclaim to evict
|
2023-01-04 13:14:27 -08:00
|
|
|
* a page we saw tagged. Skip over it.
|
2018-05-16 18:12:54 -04:00
|
|
|
*/
|
2020-12-17 00:12:26 -05:00
|
|
|
if (xa_is_value(folio))
|
2014-05-06 12:50:05 -07:00
|
|
|
continue;
|
2023-01-04 13:14:27 -08:00
|
|
|
if (!folio_batch_add(fbatch, folio)) {
|
|
|
|
unsigned long nr = folio_nr_pages(folio);
|
|
|
|
*start = folio->index + nr;
|
2017-11-15 17:34:33 -08:00
|
|
|
goto out;
|
|
|
|
}
|
2008-07-25 19:45:31 -07:00
|
|
|
}
|
2017-11-15 17:34:33 -08:00
|
|
|
/*
|
2023-01-04 13:14:27 -08:00
|
|
|
* We come here when there is no page beyond @end. We take care to not
|
|
|
|
* overflow the index @start as it confuses some of the callers. This
|
|
|
|
* breaks the iteration when there is a page at index -1 but that is
|
|
|
|
* already broke anyway.
|
2017-11-15 17:34:33 -08:00
|
|
|
*/
|
|
|
|
if (end == (pgoff_t)-1)
|
2023-01-04 13:14:27 -08:00
|
|
|
*start = (pgoff_t)-1;
|
2017-11-15 17:34:33 -08:00
|
|
|
else
|
2023-01-04 13:14:27 -08:00
|
|
|
*start = end + 1;
|
2017-11-15 17:34:33 -08:00
|
|
|
out:
|
2008-07-25 19:45:31 -07:00
|
|
|
rcu_read_unlock();
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-01-04 13:14:27 -08:00
|
|
|
return folio_batch_count(fbatch);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2023-01-04 13:14:27 -08:00
|
|
|
EXPORT_SYMBOL(filemap_get_folios_tag);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
[PATCH] readahead: backoff on I/O error
Backoff readahead size exponentially on I/O error.
Michael Tokarev <mjt@tls.msk.ru> described the problem as:
[QUOTE]
Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
In order to "fix" it, one have to read it and write to another CD-rom,
or something.. or just ignore the error (if it's just a skip in a video
stream). Let's assume the unreadable block is number U.
But current behavior is just insane. An application requests block
number N, which is before U. Kernel tries to read-ahead blocks N..U.
Cdrom drive tries to read it, re-read it.. for some time. Finally,
when all the N..U-1 blocks are read, kernel returns block number N
(as requested) to an application, successefully.
Now an app requests block number N+1, and kernel tries to read
blocks N+1..U+1. Retrying again as in previous step.
And so on, up to when an app requests block number U-1. And when,
finally, it requests block U, it receives read error.
So, kernel currentry tries to re-read the same failing block as
many times as the current readahead value (256 (times?) by default).
This whole process already killed my cdrom drive (I posted about it
to LKML several months ago) - literally, the drive has fried, and
does not work anymore. Ofcourse that problem was a bug in firmware
(or whatever) of the drive *too*, but.. main problem with that is
current readahead logic as described above.
[/QUOTE]
Which was confirmed by Jens Axboe <axboe@suse.de>:
[QUOTE]
For ide-cd, it tends do only end the first part of the request on a
medium error. So you may see a lot of repeats :/
[/QUOTE]
With this patch, retries are expected to be reduced from, say, 256, to 5.
[akpm@osdl.org: cleanups]
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 05:48:43 -07:00
|
|
|
/*
|
|
|
|
* CD/DVDs are error prone. When a medium error occurs, the driver may fail
|
|
|
|
* a _large_ part of the i/o request. Imagine the worst scenario:
|
|
|
|
*
|
|
|
|
* ---R__________________________________________B__________
|
|
|
|
* ^ reading here ^ bad block(assume 4k)
|
|
|
|
*
|
|
|
|
* read(R) => miss => readahead(R...B) => media error => frustrating retries
|
|
|
|
* => failing the whole request => read(R) => read(R+1) =>
|
|
|
|
* readahead(R+1...B+1) => bang => read(R+2) => read(R+3) =>
|
|
|
|
* readahead(R+3...B+2) => bang => read(R+3) => read(R+4) =>
|
|
|
|
* readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
|
|
|
|
*
|
|
|
|
* It is going insane. Fix it by quickly scaling down the readahead size.
|
|
|
|
*/
|
2020-04-01 21:04:50 -07:00
|
|
|
static void shrink_readahead_size_eio(struct file_ra_state *ra)
|
[PATCH] readahead: backoff on I/O error
Backoff readahead size exponentially on I/O error.
Michael Tokarev <mjt@tls.msk.ru> described the problem as:
[QUOTE]
Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
In order to "fix" it, one have to read it and write to another CD-rom,
or something.. or just ignore the error (if it's just a skip in a video
stream). Let's assume the unreadable block is number U.
But current behavior is just insane. An application requests block
number N, which is before U. Kernel tries to read-ahead blocks N..U.
Cdrom drive tries to read it, re-read it.. for some time. Finally,
when all the N..U-1 blocks are read, kernel returns block number N
(as requested) to an application, successefully.
Now an app requests block number N+1, and kernel tries to read
blocks N+1..U+1. Retrying again as in previous step.
And so on, up to when an app requests block number U-1. And when,
finally, it requests block U, it receives read error.
So, kernel currentry tries to re-read the same failing block as
many times as the current readahead value (256 (times?) by default).
This whole process already killed my cdrom drive (I posted about it
to LKML several months ago) - literally, the drive has fried, and
does not work anymore. Ofcourse that problem was a bug in firmware
(or whatever) of the drive *too*, but.. main problem with that is
current readahead logic as described above.
[/QUOTE]
Which was confirmed by Jens Axboe <axboe@suse.de>:
[QUOTE]
For ide-cd, it tends do only end the first part of the request on a
medium error. So you may see a lot of repeats :/
[/QUOTE]
With this patch, retries are expected to be reduced from, say, 256, to 5.
[akpm@osdl.org: cleanups]
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 05:48:43 -07:00
|
|
|
{
|
|
|
|
ra->ra_pages /= 4;
|
|
|
|
}
|
|
|
|
|
2021-02-24 12:01:59 -08:00
|
|
|
/*
|
2021-12-06 15:25:33 -05:00
|
|
|
* filemap_get_read_batch - Get a batch of folios for read
|
2021-02-24 12:01:59 -08:00
|
|
|
*
|
2021-12-06 15:25:33 -05:00
|
|
|
* Get a batch of folios which represent a contiguous range of bytes in
|
|
|
|
* the file. No exceptional entries will be returned. If @index is in
|
|
|
|
* the middle of a folio, the entire folio will be returned. The last
|
|
|
|
* folio in the batch may have the readahead flag set or the uptodate flag
|
|
|
|
* clear so that the caller can take the appropriate action.
|
2021-02-24 12:01:59 -08:00
|
|
|
*/
|
|
|
|
static void filemap_get_read_batch(struct address_space *mapping,
|
2021-12-06 15:25:33 -05:00
|
|
|
pgoff_t index, pgoff_t max, struct folio_batch *fbatch)
|
2021-02-24 12:01:59 -08:00
|
|
|
{
|
|
|
|
XA_STATE(xas, &mapping->i_pages, index);
|
2021-03-05 10:29:41 -05:00
|
|
|
struct folio *folio;
|
2021-02-24 12:01:59 -08:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2021-03-05 10:29:41 -05:00
|
|
|
for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) {
|
|
|
|
if (xas_retry(&xas, folio))
|
2021-02-24 12:01:59 -08:00
|
|
|
continue;
|
2021-03-05 10:29:41 -05:00
|
|
|
if (xas.xa_index > max || xa_is_value(folio))
|
2021-02-24 12:01:59 -08:00
|
|
|
break;
|
2022-06-17 20:00:17 -04:00
|
|
|
if (xa_is_sibling(folio))
|
|
|
|
break;
|
2024-06-25 13:53:50 -07:00
|
|
|
if (!folio_try_get(folio))
|
2021-02-24 12:01:59 -08:00
|
|
|
goto retry;
|
|
|
|
|
2021-03-05 10:29:41 -05:00
|
|
|
if (unlikely(folio != xas_reload(&xas)))
|
2021-12-06 15:25:33 -05:00
|
|
|
goto put_folio;
|
2021-02-24 12:01:59 -08:00
|
|
|
|
2021-12-06 15:25:33 -05:00
|
|
|
if (!folio_batch_add(fbatch, folio))
|
2021-02-24 12:01:59 -08:00
|
|
|
break;
|
2021-03-05 10:29:41 -05:00
|
|
|
if (!folio_test_uptodate(folio))
|
2021-02-24 12:01:59 -08:00
|
|
|
break;
|
2021-03-05 10:29:41 -05:00
|
|
|
if (folio_test_readahead(folio))
|
2021-02-24 12:01:59 -08:00
|
|
|
break;
|
2023-06-27 10:43:49 -07:00
|
|
|
xas_advance(&xas, folio_next_index(folio) - 1);
|
2021-02-24 12:01:59 -08:00
|
|
|
continue;
|
2021-12-06 15:25:33 -05:00
|
|
|
put_folio:
|
2021-03-05 10:29:41 -05:00
|
|
|
folio_put(folio);
|
2021-02-24 12:01:59 -08:00
|
|
|
retry:
|
|
|
|
xas_reset(&xas);
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2022-05-12 17:37:01 -04:00
|
|
|
static int filemap_read_folio(struct file *file, filler_t filler,
|
2021-03-10 10:19:30 -05:00
|
|
|
struct folio *folio)
|
2020-12-14 19:04:52 -08:00
|
|
|
{
|
2022-09-15 10:41:56 +01:00
|
|
|
bool workingset = folio_test_workingset(folio);
|
|
|
|
unsigned long pflags;
|
2020-12-14 19:04:52 -08:00
|
|
|
int error;
|
|
|
|
|
|
|
|
/* Start the actual read. The read will unlock the page. */
|
2022-09-15 10:41:56 +01:00
|
|
|
if (unlikely(workingset))
|
|
|
|
psi_memstall_enter(&pflags);
|
2022-05-12 17:37:01 -04:00
|
|
|
error = filler(file, folio);
|
2022-09-15 10:41:56 +01:00
|
|
|
if (unlikely(workingset))
|
|
|
|
psi_memstall_leave(&pflags);
|
2021-02-24 12:02:15 -08:00
|
|
|
if (error)
|
|
|
|
return error;
|
2020-12-14 19:04:52 -08:00
|
|
|
|
2021-03-10 10:19:30 -05:00
|
|
|
error = folio_wait_locked_killable(folio);
|
2021-02-24 12:02:15 -08:00
|
|
|
if (error)
|
|
|
|
return error;
|
2021-03-10 10:19:30 -05:00
|
|
|
if (folio_test_uptodate(folio))
|
2021-02-24 12:02:38 -08:00
|
|
|
return 0;
|
2022-05-12 17:37:01 -04:00
|
|
|
if (file)
|
|
|
|
shrink_readahead_size_eio(&file->f_ra);
|
2021-02-24 12:02:38 -08:00
|
|
|
return -EIO;
|
2020-12-14 19:04:52 -08:00
|
|
|
}
|
|
|
|
|
2021-02-24 12:02:28 -08:00
|
|
|
static bool filemap_range_uptodate(struct address_space *mapping,
|
2023-02-08 18:18:17 +00:00
|
|
|
loff_t pos, size_t count, struct folio *folio,
|
|
|
|
bool need_uptodate)
|
2021-02-24 12:02:28 -08:00
|
|
|
{
|
2021-03-10 11:04:19 -05:00
|
|
|
if (folio_test_uptodate(folio))
|
2021-02-24 12:02:28 -08:00
|
|
|
return true;
|
|
|
|
/* pipes can't handle partially uptodate pages */
|
2023-02-08 18:18:17 +00:00
|
|
|
if (need_uptodate)
|
2021-02-24 12:02:28 -08:00
|
|
|
return false;
|
|
|
|
if (!mapping->a_ops->is_partially_uptodate)
|
|
|
|
return false;
|
2021-03-10 11:04:19 -05:00
|
|
|
if (mapping->host->i_blkbits >= folio_shift(folio))
|
2021-02-24 12:02:28 -08:00
|
|
|
return false;
|
|
|
|
|
2021-03-10 11:04:19 -05:00
|
|
|
if (folio_pos(folio) > pos) {
|
|
|
|
count -= folio_pos(folio) - pos;
|
2021-02-24 12:02:28 -08:00
|
|
|
pos = 0;
|
|
|
|
} else {
|
2021-03-10 11:04:19 -05:00
|
|
|
pos -= folio_pos(folio);
|
2021-02-24 12:02:28 -08:00
|
|
|
}
|
|
|
|
|
2022-02-09 20:21:27 +00:00
|
|
|
return mapping->a_ops->is_partially_uptodate(folio, pos, count);
|
2021-02-24 12:02:28 -08:00
|
|
|
}
|
|
|
|
|
2021-02-24 12:02:22 -08:00
|
|
|
static int filemap_update_page(struct kiocb *iocb,
|
2023-02-08 18:18:17 +00:00
|
|
|
struct address_space *mapping, size_t count,
|
|
|
|
struct folio *folio, bool need_uptodate)
|
2020-12-14 19:04:52 -08:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
2021-01-28 19:19:45 +01:00
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT) {
|
|
|
|
if (!filemap_invalidate_trylock_shared(mapping))
|
|
|
|
return -EAGAIN;
|
|
|
|
} else {
|
|
|
|
filemap_invalidate_lock_shared(mapping);
|
|
|
|
}
|
|
|
|
|
2020-12-30 17:58:40 -05:00
|
|
|
if (!folio_trylock(folio)) {
|
2021-01-28 19:19:45 +01:00
|
|
|
error = -EAGAIN;
|
2021-02-24 12:02:25 -08:00
|
|
|
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
|
2021-01-28 19:19:45 +01:00
|
|
|
goto unlock_mapping;
|
2021-02-24 12:02:25 -08:00
|
|
|
if (!(iocb->ki_flags & IOCB_WAITQ)) {
|
2021-01-28 19:19:45 +01:00
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
2021-08-16 23:36:31 -04:00
|
|
|
/*
|
|
|
|
* This is where we usually end up waiting for a
|
|
|
|
* previously submitted readahead to finish.
|
|
|
|
*/
|
|
|
|
folio_put_wait_locked(folio, TASK_KILLABLE);
|
2021-02-24 12:02:22 -08:00
|
|
|
return AOP_TRUNCATED_PAGE;
|
2021-02-24 12:02:05 -08:00
|
|
|
}
|
2020-12-30 17:58:40 -05:00
|
|
|
error = __folio_lock_async(folio, iocb->ki_waitq);
|
2021-02-24 12:02:25 -08:00
|
|
|
if (error)
|
2021-01-28 19:19:45 +01:00
|
|
|
goto unlock_mapping;
|
2020-12-14 19:04:52 -08:00
|
|
|
}
|
|
|
|
|
2021-01-28 19:19:45 +01:00
|
|
|
error = AOP_TRUNCATED_PAGE;
|
2020-12-30 17:58:40 -05:00
|
|
|
if (!folio->mapping)
|
2021-01-28 19:19:45 +01:00
|
|
|
goto unlock;
|
2020-12-14 19:04:52 -08:00
|
|
|
|
2021-02-24 12:02:28 -08:00
|
|
|
error = 0;
|
2023-02-08 18:18:17 +00:00
|
|
|
if (filemap_range_uptodate(mapping, iocb->ki_pos, count, folio,
|
|
|
|
need_uptodate))
|
2021-02-24 12:02:28 -08:00
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
error = -EAGAIN;
|
|
|
|
if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ))
|
|
|
|
goto unlock;
|
|
|
|
|
2022-05-12 17:37:01 -04:00
|
|
|
error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio,
|
|
|
|
folio);
|
2021-01-28 19:19:45 +01:00
|
|
|
goto unlock_mapping;
|
2021-02-24 12:02:28 -08:00
|
|
|
unlock:
|
2020-12-30 17:58:40 -05:00
|
|
|
folio_unlock(folio);
|
2021-01-28 19:19:45 +01:00
|
|
|
unlock_mapping:
|
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
|
|
|
if (error == AOP_TRUNCATED_PAGE)
|
2020-12-30 17:58:40 -05:00
|
|
|
folio_put(folio);
|
2021-02-24 12:02:28 -08:00
|
|
|
return error;
|
2020-12-14 19:04:52 -08:00
|
|
|
}
|
|
|
|
|
2021-03-10 10:34:00 -05:00
|
|
|
static int filemap_create_folio(struct file *file,
|
2024-08-22 15:50:10 +02:00
|
|
|
struct address_space *mapping, loff_t pos,
|
2021-12-06 15:25:33 -05:00
|
|
|
struct folio_batch *fbatch)
|
2020-12-14 19:04:52 -08:00
|
|
|
{
|
2021-03-10 10:34:00 -05:00
|
|
|
struct folio *folio;
|
2020-12-14 19:04:52 -08:00
|
|
|
int error;
|
2024-08-22 15:50:10 +02:00
|
|
|
unsigned int min_order = mapping_min_folio_order(mapping);
|
|
|
|
pgoff_t index;
|
2020-12-14 19:04:52 -08:00
|
|
|
|
2024-08-22 15:50:10 +02:00
|
|
|
folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
|
2021-03-10 10:34:00 -05:00
|
|
|
if (!folio)
|
2021-02-24 12:02:18 -08:00
|
|
|
return -ENOMEM;
|
2020-12-14 19:04:52 -08:00
|
|
|
|
2021-01-28 19:19:45 +01:00
|
|
|
/*
|
2021-03-10 10:34:00 -05:00
|
|
|
* Protect against truncate / hole punch. Grabbing invalidate_lock
|
|
|
|
* here assures we cannot instantiate and bring uptodate new
|
|
|
|
* pagecache folios after evicting page cache during truncate
|
|
|
|
* and before actually freeing blocks. Note that we could
|
|
|
|
* release invalidate_lock after inserting the folio into
|
|
|
|
* the page cache as the locked folio would then be enough to
|
|
|
|
* synchronize with hole punching. But there are code paths
|
|
|
|
* such as filemap_update_page() filling in partially uptodate
|
2022-03-23 21:29:04 -04:00
|
|
|
* pages or ->readahead() that need to hold invalidate_lock
|
2021-03-10 10:34:00 -05:00
|
|
|
* while mapping blocks for IO so let's hold the lock here as
|
|
|
|
* well to keep locking rules simple.
|
2021-01-28 19:19:45 +01:00
|
|
|
*/
|
|
|
|
filemap_invalidate_lock_shared(mapping);
|
2024-08-22 15:50:10 +02:00
|
|
|
index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
|
2021-03-10 10:34:00 -05:00
|
|
|
error = filemap_add_folio(mapping, folio, index,
|
2021-02-24 12:02:18 -08:00
|
|
|
mapping_gfp_constraint(mapping, GFP_KERNEL));
|
|
|
|
if (error == -EEXIST)
|
|
|
|
error = AOP_TRUNCATED_PAGE;
|
|
|
|
if (error)
|
|
|
|
goto error;
|
|
|
|
|
2022-05-12 17:37:01 -04:00
|
|
|
error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
|
2021-02-24 12:02:18 -08:00
|
|
|
if (error)
|
|
|
|
goto error;
|
|
|
|
|
2021-01-28 19:19:45 +01:00
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
2021-12-06 15:25:33 -05:00
|
|
|
folio_batch_add(fbatch, folio);
|
2021-02-24 12:02:18 -08:00
|
|
|
return 0;
|
|
|
|
error:
|
2021-01-28 19:19:45 +01:00
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
2021-03-10 10:34:00 -05:00
|
|
|
folio_put(folio);
|
2021-02-24 12:02:18 -08:00
|
|
|
return error;
|
2020-12-14 19:04:52 -08:00
|
|
|
}
|
|
|
|
|
2021-02-24 12:02:32 -08:00
|
|
|
static int filemap_readahead(struct kiocb *iocb, struct file *file,
|
2021-03-10 14:01:22 -05:00
|
|
|
struct address_space *mapping, struct folio *folio,
|
2021-02-24 12:02:32 -08:00
|
|
|
pgoff_t last_index)
|
|
|
|
{
|
2021-03-10 14:01:22 -05:00
|
|
|
DEFINE_READAHEAD(ractl, file, &file->f_ra, mapping, folio->index);
|
|
|
|
|
2021-02-24 12:02:32 -08:00
|
|
|
if (iocb->ki_flags & IOCB_NOIO)
|
|
|
|
return -EAGAIN;
|
2021-03-10 14:01:22 -05:00
|
|
|
page_cache_async_ra(&ractl, folio, last_index - folio->index);
|
2021-02-24 12:02:32 -08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-02-08 18:18:17 +00:00
|
|
|
static int filemap_get_pages(struct kiocb *iocb, size_t count,
|
|
|
|
struct folio_batch *fbatch, bool need_uptodate)
|
2020-12-14 19:04:56 -08:00
|
|
|
{
|
|
|
|
struct file *filp = iocb->ki_filp;
|
|
|
|
struct address_space *mapping = filp->f_mapping;
|
|
|
|
struct file_ra_state *ra = &filp->f_ra;
|
|
|
|
pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
|
2021-02-24 12:01:59 -08:00
|
|
|
pgoff_t last_index;
|
2021-03-10 14:01:22 -05:00
|
|
|
struct folio *folio;
|
mm: allow read-ahead with IOCB_NOWAIT set
Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9
("mm: allow read-ahead with IOCB_NOWAIT set"). However, this
implementation broke the semantics of IOCB_NOWAIT by potentially causing
it to wait on I/O during memory reclamation. This behavior was later
modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO").
To resolve the blocking issue during memory reclamation, we can use
memalloc_noio_{save,restore} to ensure non-blocking behavior. This change
restores the original functionality, allowing preadv2(IOCB_NOWAIT) to
trigger readahead if the file content is not present in the page cache.
While this process may trigger direct memory reclamation, the
__GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't
block.
A use case for this change is when we want to trigger readahead in the
preadv2(2) syscall if the file cache is absent, but without waiting for
certain filesystem locks, like xfs_ilock. A simple example is as follows:
retry:
if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
do_other_work();
goto retry;
}
Link: https://lore.gnuweeb.org/io-uring/20200624164127.GP21350@casper.infradead.org/
Link: https://lkml.kernel.org/r/20240820022639.89562-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-20 10:26:39 +08:00
|
|
|
unsigned int flags;
|
2021-02-24 12:01:59 -08:00
|
|
|
int err = 0;
|
2020-12-14 19:04:56 -08:00
|
|
|
|
2023-02-08 10:24:00 +08:00
|
|
|
/* "last_index" is the index of the page beyond the end of the read */
|
2023-02-08 18:18:17 +00:00
|
|
|
last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
|
2021-02-24 12:02:35 -08:00
|
|
|
retry:
|
2020-12-14 19:04:56 -08:00
|
|
|
if (fatal_signal_pending(current))
|
|
|
|
return -EINTR;
|
|
|
|
|
2023-02-08 10:24:00 +08:00
|
|
|
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
|
2021-12-06 15:25:33 -05:00
|
|
|
if (!folio_batch_count(fbatch)) {
|
2021-02-24 12:02:35 -08:00
|
|
|
if (iocb->ki_flags & IOCB_NOIO)
|
|
|
|
return -EAGAIN;
|
mm: allow read-ahead with IOCB_NOWAIT set
Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9
("mm: allow read-ahead with IOCB_NOWAIT set"). However, this
implementation broke the semantics of IOCB_NOWAIT by potentially causing
it to wait on I/O during memory reclamation. This behavior was later
modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO").
To resolve the blocking issue during memory reclamation, we can use
memalloc_noio_{save,restore} to ensure non-blocking behavior. This change
restores the original functionality, allowing preadv2(IOCB_NOWAIT) to
trigger readahead if the file content is not present in the page cache.
While this process may trigger direct memory reclamation, the
__GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't
block.
A use case for this change is when we want to trigger readahead in the
preadv2(2) syscall if the file cache is absent, but without waiting for
certain filesystem locks, like xfs_ilock. A simple example is as follows:
retry:
if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
do_other_work();
goto retry;
}
Link: https://lore.gnuweeb.org/io-uring/20200624164127.GP21350@casper.infradead.org/
Link: https://lkml.kernel.org/r/20240820022639.89562-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-20 10:26:39 +08:00
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT)
|
|
|
|
flags = memalloc_noio_save();
|
2021-02-24 12:02:35 -08:00
|
|
|
page_cache_sync_readahead(mapping, ra, filp, index,
|
|
|
|
last_index - index);
|
mm: allow read-ahead with IOCB_NOWAIT set
Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9
("mm: allow read-ahead with IOCB_NOWAIT set"). However, this
implementation broke the semantics of IOCB_NOWAIT by potentially causing
it to wait on I/O during memory reclamation. This behavior was later
modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO").
To resolve the blocking issue during memory reclamation, we can use
memalloc_noio_{save,restore} to ensure non-blocking behavior. This change
restores the original functionality, allowing preadv2(IOCB_NOWAIT) to
trigger readahead if the file content is not present in the page cache.
While this process may trigger direct memory reclamation, the
__GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't
block.
A use case for this change is when we want to trigger readahead in the
preadv2(2) syscall if the file cache is absent, but without waiting for
certain filesystem locks, like xfs_ilock. A simple example is as follows:
retry:
if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
do_other_work();
goto retry;
}
Link: https://lore.gnuweeb.org/io-uring/20200624164127.GP21350@casper.infradead.org/
Link: https://lkml.kernel.org/r/20240820022639.89562-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-20 10:26:39 +08:00
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT)
|
|
|
|
memalloc_noio_restore(flags);
|
2023-02-08 10:24:00 +08:00
|
|
|
filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
|
2021-02-24 12:02:35 -08:00
|
|
|
}
|
2021-12-06 15:25:33 -05:00
|
|
|
if (!folio_batch_count(fbatch)) {
|
2021-02-24 12:02:18 -08:00
|
|
|
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
|
|
|
|
return -EAGAIN;
|
2024-08-22 15:50:10 +02:00
|
|
|
err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
|
2021-02-24 12:02:18 -08:00
|
|
|
if (err == AOP_TRUNCATED_PAGE)
|
2021-02-24 12:02:35 -08:00
|
|
|
goto retry;
|
2021-02-24 12:02:18 -08:00
|
|
|
return err;
|
|
|
|
}
|
2020-12-14 19:04:56 -08:00
|
|
|
|
2021-12-06 15:25:33 -05:00
|
|
|
folio = fbatch->folios[folio_batch_count(fbatch) - 1];
|
2021-03-10 14:01:22 -05:00
|
|
|
if (folio_test_readahead(folio)) {
|
|
|
|
err = filemap_readahead(iocb, filp, mapping, folio, last_index);
|
2021-02-24 12:02:35 -08:00
|
|
|
if (err)
|
|
|
|
goto err;
|
|
|
|
}
|
2021-03-10 14:01:22 -05:00
|
|
|
if (!folio_test_uptodate(folio)) {
|
2021-12-06 15:25:33 -05:00
|
|
|
if ((iocb->ki_flags & IOCB_WAITQ) &&
|
|
|
|
folio_batch_count(fbatch) > 1)
|
2021-02-24 12:02:35 -08:00
|
|
|
iocb->ki_flags |= IOCB_NOWAIT;
|
2023-02-08 18:18:17 +00:00
|
|
|
err = filemap_update_page(iocb, mapping, count, folio,
|
|
|
|
need_uptodate);
|
2021-02-24 12:02:35 -08:00
|
|
|
if (err)
|
|
|
|
goto err;
|
2020-12-14 19:04:56 -08:00
|
|
|
}
|
|
|
|
|
2024-09-03 10:21:00 +00:00
|
|
|
trace_mm_filemap_get_pages(mapping, index, last_index - 1);
|
2021-02-24 12:02:35 -08:00
|
|
|
return 0;
|
2021-02-24 12:01:59 -08:00
|
|
|
err:
|
2021-02-24 12:02:35 -08:00
|
|
|
if (err < 0)
|
2021-03-10 14:01:22 -05:00
|
|
|
folio_put(folio);
|
2021-12-06 15:25:33 -05:00
|
|
|
if (likely(--fbatch->nr))
|
2021-02-24 12:01:55 -08:00
|
|
|
return 0;
|
2021-02-24 12:02:22 -08:00
|
|
|
if (err == AOP_TRUNCATED_PAGE)
|
2021-02-24 12:02:35 -08:00
|
|
|
goto retry;
|
|
|
|
return err;
|
2020-12-14 19:04:56 -08:00
|
|
|
}
|
|
|
|
|
2022-06-10 14:44:41 -04:00
|
|
|
static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
|
|
|
|
{
|
|
|
|
unsigned int shift = folio_shift(folio);
|
|
|
|
|
|
|
|
return (pos1 >> shift == pos2 >> shift);
|
|
|
|
}
|
|
|
|
|
2006-06-23 02:03:49 -07:00
|
|
|
/**
|
2021-02-24 12:02:42 -08:00
|
|
|
* filemap_read - Read data from the page cache.
|
|
|
|
* @iocb: The iocb to read.
|
|
|
|
* @iter: Destination for the data.
|
|
|
|
* @already_read: Number of bytes already read by the caller.
|
2006-06-23 02:03:49 -07:00
|
|
|
*
|
2021-02-24 12:02:42 -08:00
|
|
|
* Copies data from the page cache. If the data is not currently present,
|
2022-04-29 11:53:28 -04:00
|
|
|
* uses the readahead and read_folio address_space operations to fetch it.
|
2005-04-16 15:20:36 -07:00
|
|
|
*
|
2021-02-24 12:02:42 -08:00
|
|
|
* Return: Total number of bytes copied, including those already read by
|
|
|
|
* the caller. If an error happens before any bytes are copied, returns
|
|
|
|
* a negative error number.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2021-02-24 12:02:42 -08:00
|
|
|
ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
|
|
|
|
ssize_t already_read)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2017-08-29 16:13:18 +02:00
|
|
|
struct file *filp = iocb->ki_filp;
|
2020-12-14 19:04:56 -08:00
|
|
|
struct file_ra_state *ra = &filp->f_ra;
|
2008-02-08 04:21:24 -08:00
|
|
|
struct address_space *mapping = filp->f_mapping;
|
2005-04-16 15:20:36 -07:00
|
|
|
struct inode *inode = mapping->host;
|
2021-12-06 15:25:33 -05:00
|
|
|
struct folio_batch fbatch;
|
2021-02-24 12:01:55 -08:00
|
|
|
int i, error = 0;
|
2020-12-14 19:04:56 -08:00
|
|
|
bool writably_mapped;
|
|
|
|
loff_t isize, end_offset;
|
mm/filemap.c: fix update prev_pos after one read request done
ra->prev_pos tracks the last visited byte in the previous read request.
It is used to check whether it is sequential read in ondemand_readahead
and thus affects the readahead window.
After commit 06c0444290ce ("mm/filemap.c: generic_file_buffered_read() now
uses find_get_pages_contig"), update logic of prev_pos is changed. It
updates prev_pos after each return from filemap_get_pages(). But the read
request from user may be not fully completed at this point. The updated
prev_pos impacts the subsequent readahead window.
The real problem is performance drop of fsck_msdos between linux-5.4 and
linux-5.15(also linux-6.4). Comparing to linux-5.4,It spends about 110%
time and read 140% pages. The read pattern of fsck_msdos is not fully
sequential.
Simplified read pattern of fsck_msdos likes below:
1.read at page offset 0xa,size 0x1000
2.read at other page offset like 0x20,size 0x1000
3.read at page offset 0xa,size 0x4000
4.read at page offset 0xe,size 0x1000
Here is the read status on linux-6.4:
1.after read at page offset 0xa,size 0x1000
->page ofs 0xa go into pagecache
2.after read at page offset 0x20,size 0x1000
->page ofs 0x20 go into pagecache
3.read at page offset 0xa,size 0x4000
->filemap_get_pages read ofs 0xa from pagecache and returns
->prev_pos is updated to 0xb and goto next loop
->filemap_get_pages tends to read ofs 0xb,size 0x3000
->initial_readahead case in ondemand_readahead since prev_pos is
the same as request ofs.
->read 8 pages while async size is 5 pages
(PageReadahead flag at page 0xe)
4.read at page offset 0xe,size 0x1000
->hit page 0xe with PageReadahead flag set,double the ra_size.
read 16 pages while async size is 16 pages
Now it reads 24 pages while actually uses 5 pages
on linux-5.4:
1.the same as 6.4
2.the same as 6.4
3.read at page offset 0xa,size 0x4000
->read ofs 0xa from pagecache
->read ofs 0xb,size 0x3000 using page_cache_sync_readahead
read 3 pages
->prev_pos is updated to 0xd before generic_file_buffered_read
returns
4.read at page offset 0xe,size 0x1000
->initial_readahead case in ondemand_readahead since
request ofs-prev_pos==1
->read 4 pages while async size is 3 pages
Now it reads 7 pages while actually uses 5 pages.
In above demo, the initial_readahead case is triggered by offset of user
request on linux-5.4. While it may be triggered by update logic of
prev_pos on linux-6.4.
To fix the performance drop, update prev_pos after finishing one read
request.
Link: https://lkml.kernel.org/r/20230628110220.120134-1-haibo.li@mediatek.com
Signed-off-by: Haibo Li <haibo.li@mediatek.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-28 19:02:20 +08:00
|
|
|
loff_t last_pos = ra->prev_pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2020-12-14 19:04:52 -08:00
|
|
|
if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes))
|
2016-12-14 12:45:25 -08:00
|
|
|
return 0;
|
2020-12-18 04:07:11 -05:00
|
|
|
if (unlikely(!iov_iter_count(iter)))
|
|
|
|
return 0;
|
|
|
|
|
2024-09-13 13:57:04 -04:00
|
|
|
iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos);
|
2021-12-06 15:25:33 -05:00
|
|
|
folio_batch_init(&fbatch);
|
vfs,mm: fix a dead loop in truncate_inode_pages_range()
We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:
...
fd = open();
write(fd, buf, 4096);
preadv64(fd, &iovec, 1, 0xffffffff000);
ftruncate(fd, 0);
...
Then ftruncate() will not return forever.
The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.
When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping. Then
this page can be found in ->mapping with pagevec_lookup(). After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:
- find a page with index=0xffffffff, since index>=end, this page won't
be truncated
- index++, and index become 0
- the page with index=0xffffffff will be found again
The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.
Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:
INFO: task truncate_test:3364 blocked for more than 120 seconds.
...
call_rwsem_down_write_failed+0x17/0x30
generic_file_write_iter+0x32/0x1c0
ubifs_write_iter+0xcc/0x170
__vfs_write+0xc4/0x120
vfs_write+0xb2/0x1b0
SyS_write+0x46/0xa0
The page with index=0xffffffff added to ->mapping is useless. Fix this
by checking the read position before allocating pages.
Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 17:01:52 -07:00
|
|
|
|
2020-12-14 19:04:56 -08:00
|
|
|
do {
|
2005-04-16 15:20:36 -07:00
|
|
|
cond_resched();
|
2017-02-03 13:13:29 -08:00
|
|
|
|
2020-12-14 19:04:52 -08:00
|
|
|
/*
|
2020-12-14 19:04:56 -08:00
|
|
|
* If we've already successfully copied some data, then we
|
|
|
|
* can no longer safely return -EIOCBQUEUED. Hence mark
|
|
|
|
* an async read NOWAIT at that point.
|
2020-12-14 19:04:52 -08:00
|
|
|
*/
|
2021-02-24 12:02:42 -08:00
|
|
|
if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
|
2020-12-14 19:04:52 -08:00
|
|
|
iocb->ki_flags |= IOCB_NOWAIT;
|
|
|
|
|
2021-11-05 13:36:49 -07:00
|
|
|
if (unlikely(iocb->ki_pos >= i_size_read(inode)))
|
|
|
|
break;
|
|
|
|
|
2023-05-22 14:50:17 +01:00
|
|
|
error = filemap_get_pages(iocb, iter->count, &fbatch, false);
|
2021-02-24 12:01:55 -08:00
|
|
|
if (error < 0)
|
2020-12-14 19:04:56 -08:00
|
|
|
break;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2020-12-14 19:04:56 -08:00
|
|
|
/*
|
|
|
|
* i_size must be checked after we know the pages are Uptodate.
|
|
|
|
*
|
|
|
|
* Checking i_size after the check allows us to calculate
|
|
|
|
* the correct value for "nr", which means the zero-filled
|
|
|
|
* part of the page is not copied back to userspace (unless
|
|
|
|
* another truncate extends the file - this is desired though).
|
|
|
|
*/
|
|
|
|
isize = i_size_read(inode);
|
|
|
|
if (unlikely(iocb->ki_pos >= isize))
|
2021-12-06 15:25:33 -05:00
|
|
|
goto put_folios;
|
2020-12-14 19:04:56 -08:00
|
|
|
end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Once we start copying data, we don't want to be touching any
|
|
|
|
* cachelines that might be contended:
|
|
|
|
*/
|
|
|
|
writably_mapped = mapping_writably_mapped(mapping);
|
|
|
|
|
|
|
|
/*
|
2022-06-10 14:44:41 -04:00
|
|
|
* When a read accesses the same folio several times, only
|
2020-12-14 19:04:56 -08:00
|
|
|
* mark it as accessed the first time.
|
|
|
|
*/
|
mm/filemap.c: fix update prev_pos after one read request done
ra->prev_pos tracks the last visited byte in the previous read request.
It is used to check whether it is sequential read in ondemand_readahead
and thus affects the readahead window.
After commit 06c0444290ce ("mm/filemap.c: generic_file_buffered_read() now
uses find_get_pages_contig"), update logic of prev_pos is changed. It
updates prev_pos after each return from filemap_get_pages(). But the read
request from user may be not fully completed at this point. The updated
prev_pos impacts the subsequent readahead window.
The real problem is performance drop of fsck_msdos between linux-5.4 and
linux-5.15(also linux-6.4). Comparing to linux-5.4,It spends about 110%
time and read 140% pages. The read pattern of fsck_msdos is not fully
sequential.
Simplified read pattern of fsck_msdos likes below:
1.read at page offset 0xa,size 0x1000
2.read at other page offset like 0x20,size 0x1000
3.read at page offset 0xa,size 0x4000
4.read at page offset 0xe,size 0x1000
Here is the read status on linux-6.4:
1.after read at page offset 0xa,size 0x1000
->page ofs 0xa go into pagecache
2.after read at page offset 0x20,size 0x1000
->page ofs 0x20 go into pagecache
3.read at page offset 0xa,size 0x4000
->filemap_get_pages read ofs 0xa from pagecache and returns
->prev_pos is updated to 0xb and goto next loop
->filemap_get_pages tends to read ofs 0xb,size 0x3000
->initial_readahead case in ondemand_readahead since prev_pos is
the same as request ofs.
->read 8 pages while async size is 5 pages
(PageReadahead flag at page 0xe)
4.read at page offset 0xe,size 0x1000
->hit page 0xe with PageReadahead flag set,double the ra_size.
read 16 pages while async size is 16 pages
Now it reads 24 pages while actually uses 5 pages
on linux-5.4:
1.the same as 6.4
2.the same as 6.4
3.read at page offset 0xa,size 0x4000
->read ofs 0xa from pagecache
->read ofs 0xb,size 0x3000 using page_cache_sync_readahead
read 3 pages
->prev_pos is updated to 0xd before generic_file_buffered_read
returns
4.read at page offset 0xe,size 0x1000
->initial_readahead case in ondemand_readahead since
request ofs-prev_pos==1
->read 4 pages while async size is 3 pages
Now it reads 7 pages while actually uses 5 pages.
In above demo, the initial_readahead case is triggered by offset of user
request on linux-5.4. While it may be triggered by update logic of
prev_pos on linux-6.4.
To fix the performance drop, update prev_pos after finishing one read
request.
Link: https://lkml.kernel.org/r/20230628110220.120134-1-haibo.li@mediatek.com
Signed-off-by: Haibo Li <haibo.li@mediatek.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-28 19:02:20 +08:00
|
|
|
if (!pos_same_folio(iocb->ki_pos, last_pos - 1,
|
|
|
|
fbatch.folios[0]))
|
2021-12-06 15:25:33 -05:00
|
|
|
folio_mark_accessed(fbatch.folios[0]);
|
2020-12-14 19:04:56 -08:00
|
|
|
|
2021-12-06 15:25:33 -05:00
|
|
|
for (i = 0; i < folio_batch_count(&fbatch); i++) {
|
|
|
|
struct folio *folio = fbatch.folios[i];
|
2021-10-31 22:22:19 -04:00
|
|
|
size_t fsize = folio_size(folio);
|
|
|
|
size_t offset = iocb->ki_pos & (fsize - 1);
|
2021-02-24 12:01:59 -08:00
|
|
|
size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos,
|
2021-10-31 22:22:19 -04:00
|
|
|
fsize - offset);
|
2021-02-24 12:01:59 -08:00
|
|
|
size_t copied;
|
2020-12-14 19:04:56 -08:00
|
|
|
|
2021-10-31 22:22:19 -04:00
|
|
|
if (end_offset < folio_pos(folio))
|
2021-02-24 12:01:59 -08:00
|
|
|
break;
|
|
|
|
if (i > 0)
|
2021-10-31 22:22:19 -04:00
|
|
|
folio_mark_accessed(folio);
|
2020-12-14 19:04:56 -08:00
|
|
|
/*
|
2021-10-31 22:22:19 -04:00
|
|
|
* If users can be writing to this folio using arbitrary
|
|
|
|
* virtual addresses, take care of potential aliasing
|
|
|
|
* before reading the folio on the kernel side.
|
2020-12-14 19:04:56 -08:00
|
|
|
*/
|
2021-10-31 22:22:19 -04:00
|
|
|
if (writably_mapped)
|
|
|
|
flush_dcache_folio(folio);
|
2020-12-14 19:04:56 -08:00
|
|
|
|
2021-10-31 22:22:19 -04:00
|
|
|
copied = copy_folio_to_iter(folio, offset, bytes, iter);
|
2020-12-14 19:04:56 -08:00
|
|
|
|
2021-02-24 12:02:42 -08:00
|
|
|
already_read += copied;
|
2020-12-14 19:04:56 -08:00
|
|
|
iocb->ki_pos += copied;
|
mm/filemap.c: fix update prev_pos after one read request done
ra->prev_pos tracks the last visited byte in the previous read request.
It is used to check whether it is sequential read in ondemand_readahead
and thus affects the readahead window.
After commit 06c0444290ce ("mm/filemap.c: generic_file_buffered_read() now
uses find_get_pages_contig"), update logic of prev_pos is changed. It
updates prev_pos after each return from filemap_get_pages(). But the read
request from user may be not fully completed at this point. The updated
prev_pos impacts the subsequent readahead window.
The real problem is performance drop of fsck_msdos between linux-5.4 and
linux-5.15(also linux-6.4). Comparing to linux-5.4,It spends about 110%
time and read 140% pages. The read pattern of fsck_msdos is not fully
sequential.
Simplified read pattern of fsck_msdos likes below:
1.read at page offset 0xa,size 0x1000
2.read at other page offset like 0x20,size 0x1000
3.read at page offset 0xa,size 0x4000
4.read at page offset 0xe,size 0x1000
Here is the read status on linux-6.4:
1.after read at page offset 0xa,size 0x1000
->page ofs 0xa go into pagecache
2.after read at page offset 0x20,size 0x1000
->page ofs 0x20 go into pagecache
3.read at page offset 0xa,size 0x4000
->filemap_get_pages read ofs 0xa from pagecache and returns
->prev_pos is updated to 0xb and goto next loop
->filemap_get_pages tends to read ofs 0xb,size 0x3000
->initial_readahead case in ondemand_readahead since prev_pos is
the same as request ofs.
->read 8 pages while async size is 5 pages
(PageReadahead flag at page 0xe)
4.read at page offset 0xe,size 0x1000
->hit page 0xe with PageReadahead flag set,double the ra_size.
read 16 pages while async size is 16 pages
Now it reads 24 pages while actually uses 5 pages
on linux-5.4:
1.the same as 6.4
2.the same as 6.4
3.read at page offset 0xa,size 0x4000
->read ofs 0xa from pagecache
->read ofs 0xb,size 0x3000 using page_cache_sync_readahead
read 3 pages
->prev_pos is updated to 0xd before generic_file_buffered_read
returns
4.read at page offset 0xe,size 0x1000
->initial_readahead case in ondemand_readahead since
request ofs-prev_pos==1
->read 4 pages while async size is 3 pages
Now it reads 7 pages while actually uses 5 pages.
In above demo, the initial_readahead case is triggered by offset of user
request on linux-5.4. While it may be triggered by update logic of
prev_pos on linux-6.4.
To fix the performance drop, update prev_pos after finishing one read
request.
Link: https://lkml.kernel.org/r/20230628110220.120134-1-haibo.li@mediatek.com
Signed-off-by: Haibo Li <haibo.li@mediatek.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-28 19:02:20 +08:00
|
|
|
last_pos = iocb->ki_pos;
|
2020-12-14 19:04:56 -08:00
|
|
|
|
|
|
|
if (copied < bytes) {
|
|
|
|
error = -EFAULT;
|
|
|
|
break;
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2021-12-06 15:25:33 -05:00
|
|
|
put_folios:
|
|
|
|
for (i = 0; i < folio_batch_count(&fbatch); i++)
|
|
|
|
folio_put(fbatch.folios[i]);
|
|
|
|
folio_batch_init(&fbatch);
|
2020-12-14 19:04:56 -08:00
|
|
|
} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2008-10-15 22:01:13 -07:00
|
|
|
file_accessed(filp);
|
mm/filemap.c: fix update prev_pos after one read request done
ra->prev_pos tracks the last visited byte in the previous read request.
It is used to check whether it is sequential read in ondemand_readahead
and thus affects the readahead window.
After commit 06c0444290ce ("mm/filemap.c: generic_file_buffered_read() now
uses find_get_pages_contig"), update logic of prev_pos is changed. It
updates prev_pos after each return from filemap_get_pages(). But the read
request from user may be not fully completed at this point. The updated
prev_pos impacts the subsequent readahead window.
The real problem is performance drop of fsck_msdos between linux-5.4 and
linux-5.15(also linux-6.4). Comparing to linux-5.4,It spends about 110%
time and read 140% pages. The read pattern of fsck_msdos is not fully
sequential.
Simplified read pattern of fsck_msdos likes below:
1.read at page offset 0xa,size 0x1000
2.read at other page offset like 0x20,size 0x1000
3.read at page offset 0xa,size 0x4000
4.read at page offset 0xe,size 0x1000
Here is the read status on linux-6.4:
1.after read at page offset 0xa,size 0x1000
->page ofs 0xa go into pagecache
2.after read at page offset 0x20,size 0x1000
->page ofs 0x20 go into pagecache
3.read at page offset 0xa,size 0x4000
->filemap_get_pages read ofs 0xa from pagecache and returns
->prev_pos is updated to 0xb and goto next loop
->filemap_get_pages tends to read ofs 0xb,size 0x3000
->initial_readahead case in ondemand_readahead since prev_pos is
the same as request ofs.
->read 8 pages while async size is 5 pages
(PageReadahead flag at page 0xe)
4.read at page offset 0xe,size 0x1000
->hit page 0xe with PageReadahead flag set,double the ra_size.
read 16 pages while async size is 16 pages
Now it reads 24 pages while actually uses 5 pages
on linux-5.4:
1.the same as 6.4
2.the same as 6.4
3.read at page offset 0xa,size 0x4000
->read ofs 0xa from pagecache
->read ofs 0xb,size 0x3000 using page_cache_sync_readahead
read 3 pages
->prev_pos is updated to 0xd before generic_file_buffered_read
returns
4.read at page offset 0xe,size 0x1000
->initial_readahead case in ondemand_readahead since
request ofs-prev_pos==1
->read 4 pages while async size is 3 pages
Now it reads 7 pages while actually uses 5 pages.
In above demo, the initial_readahead case is triggered by offset of user
request on linux-5.4. While it may be triggered by update logic of
prev_pos on linux-6.4.
To fix the performance drop, update prev_pos after finishing one read
request.
Link: https://lkml.kernel.org/r/20230628110220.120134-1-haibo.li@mediatek.com
Signed-off-by: Haibo Li <haibo.li@mediatek.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-28 19:02:20 +08:00
|
|
|
ra->prev_pos = last_pos;
|
2021-02-24 12:02:42 -08:00
|
|
|
return already_read ? already_read : error;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2021-02-24 12:02:42 -08:00
|
|
|
EXPORT_SYMBOL_GPL(filemap_read);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-06-01 16:58:56 +02:00
|
|
|
int kiocb_write_and_wait(struct kiocb *iocb, size_t count)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = iocb->ki_filp->f_mapping;
|
|
|
|
loff_t pos = iocb->ki_pos;
|
|
|
|
loff_t end = pos + count - 1;
|
|
|
|
|
|
|
|
if (iocb->ki_flags & IOCB_NOWAIT) {
|
|
|
|
if (filemap_range_needs_writeback(mapping, pos, end))
|
|
|
|
return -EAGAIN;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return filemap_write_and_wait_range(mapping, pos, end);
|
|
|
|
}
|
netfs: Implement unbuffered/DIO read support
Implement support for unbuffered and DIO reads in the netfs library,
utilising the existing read helper code to do block splitting and
individual queuing. The code also handles extraction of the destination
buffer from the supplied iterator, allowing async unbuffered reads to take
place.
The read will be split up according to the rsize setting and, if supplied,
the ->clamp_length() method. Note that the next subrequest will be issued
as soon as issue_op returns, without waiting for previous ones to finish.
The network filesystem needs to pause or handle queuing them if it doesn't
want to fire them all at the server simultaneously.
Once all the subrequests have finished, the state will be assessed and the
amount of data to be indicated as having being obtained will be
determined. As the subrequests may finish in any order, if an intermediate
subrequest is short, any further subrequests may be copied into the buffer
and then abandoned.
In the future, this will also take care of doing an unbuffered read from
encrypted content, with the decryption being done by the library.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2022-01-14 17:39:55 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kiocb_write_and_wait);
|
2023-06-01 16:58:56 +02:00
|
|
|
|
2024-09-11 17:34:39 +01:00
|
|
|
int filemap_invalidate_pages(struct address_space *mapping,
|
|
|
|
loff_t pos, loff_t end, bool nowait)
|
2023-06-01 16:58:57 +02:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2024-09-11 17:34:39 +01:00
|
|
|
if (nowait) {
|
2023-06-01 16:58:57 +02:00
|
|
|
/* we could block if there are any pages in the range */
|
|
|
|
if (filemap_range_has_page(mapping, pos, end))
|
|
|
|
return -EAGAIN;
|
|
|
|
} else {
|
|
|
|
ret = filemap_write_and_wait_range(mapping, pos, end);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After a write we want buffered reads to be sure to go to disk to get
|
|
|
|
* the new data. We invalidate clean cached page from the region we're
|
|
|
|
* about to write. We do this *before* the write so that we can return
|
|
|
|
* without clobbering -EIOCBQUEUED from ->direct_IO().
|
|
|
|
*/
|
|
|
|
return invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
|
|
|
|
end >> PAGE_SHIFT);
|
|
|
|
}
|
2024-09-11 17:34:39 +01:00
|
|
|
|
|
|
|
int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = iocb->ki_filp->f_mapping;
|
|
|
|
|
|
|
|
return filemap_invalidate_pages(mapping, iocb->ki_pos,
|
|
|
|
iocb->ki_pos + count - 1,
|
|
|
|
iocb->ki_flags & IOCB_NOWAIT);
|
|
|
|
}
|
2022-02-21 11:38:17 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kiocb_invalidate_pages);
|
2023-06-01 16:58:57 +02:00
|
|
|
|
2006-06-23 02:03:49 -07:00
|
|
|
/**
|
2014-04-04 14:20:57 -04:00
|
|
|
* generic_file_read_iter - generic filesystem read routine
|
2006-06-23 02:03:49 -07:00
|
|
|
* @iocb: kernel I/O control block
|
2014-04-04 14:20:57 -04:00
|
|
|
* @iter: destination for the data read
|
2006-06-23 02:03:49 -07:00
|
|
|
*
|
2014-04-04 14:20:57 -04:00
|
|
|
* This is the "read_iter()" routine for all filesystems
|
2005-04-16 15:20:36 -07:00
|
|
|
* that can use the page cache directly.
|
2019-11-21 23:25:07 +00:00
|
|
|
*
|
|
|
|
* The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall
|
|
|
|
* be returned when no data can be read without waiting for I/O requests
|
|
|
|
* to complete; it doesn't prevent readahead.
|
|
|
|
*
|
|
|
|
* The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/O
|
|
|
|
* requests shall be made for the read or for readahead. When no data
|
|
|
|
* can be read, -EAGAIN shall be returned. When readahead would be
|
|
|
|
* triggered, a partial, possibly empty read shall be returned.
|
|
|
|
*
|
2019-03-05 15:48:42 -08:00
|
|
|
* Return:
|
|
|
|
* * number of bytes copied, even for partial reads
|
2019-11-21 23:25:07 +00:00
|
|
|
* * negative error code (or 0 if IOCB_NOIO) if nothing was read
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
|
|
|
ssize_t
|
2014-03-05 22:53:04 -05:00
|
|
|
generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
If
- generic_file_read_iter() gets called with a zero read length,
- the read offset is at a page boundary,
- IOCB_DIRECT is not set
- and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.
Since roundup_pow_of_two(0) is undefined, UBSAN reports
UBSAN: Undefined behaviour in include/linux/log2.h:63:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
[...]
Call Trace:
[...]
[<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
[<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
[<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
[<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
[<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
[<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
[...]
[<ffffffff81510b06>] __vfs_read+0x256/0x3d0
[...]
when get_init_ra_size() gets called from ondemand_readahead().
The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since
1UL << (sizeof(unsigned long) * 8)
evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.
What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().
According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 14:22:14 -07:00
|
|
|
size_t count = iov_iter_count(iter);
|
2017-08-29 16:13:18 +02:00
|
|
|
ssize_t retval = 0;
|
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
If
- generic_file_read_iter() gets called with a zero read length,
- the read offset is at a page boundary,
- IOCB_DIRECT is not set
- and the page in question hasn't made it into the page cache yet,
then do_generic_file_read() will trigger a readahead with a req_size hint
of zero.
Since roundup_pow_of_two(0) is undefined, UBSAN reports
UBSAN: Undefined behaviour in include/linux/log2.h:63:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
[...]
Call Trace:
[...]
[<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
[<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
[<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
[<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
[<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
[<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
[...]
[<ffffffff81510b06>] __vfs_read+0x256/0x3d0
[...]
when get_init_ra_size() gets called from ondemand_readahead().
The net effect is that the initial readahead size is arch dependent for
requested read lengths of zero: for example, since
1UL << (sizeof(unsigned long) * 8)
evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
size becomes 4 on the former and 0 on the latter.
What's more, whether or not the file access timestamp is updated for zero
length reads is decided differently for the two cases of IOCB_DIRECT
being set or cleared: in the first case, generic_file_read_iter()
explicitly skips updating that timestamp while in the latter case, it is
always updated through the call to do_generic_file_read().
According to POSIX, zero length reads "do not modify the last data access
timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
Let generic_file_read_iter() unconditionally check the requested read
length at its entry and return immediately with success if it is zero.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 14:22:14 -07:00
|
|
|
|
|
|
|
if (!count)
|
2021-02-24 12:02:45 -08:00
|
|
|
return 0; /* skip atime */
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2015-04-09 13:52:01 -04:00
|
|
|
if (iocb->ki_flags & IOCB_DIRECT) {
|
2017-08-29 16:13:18 +02:00
|
|
|
struct file *file = iocb->ki_filp;
|
2014-03-05 22:53:04 -05:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
struct inode *inode = mapping->host;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-06-01 16:58:56 +02:00
|
|
|
retval = kiocb_write_and_wait(iocb, count);
|
|
|
|
if (retval < 0)
|
|
|
|
return retval;
|
2016-10-03 09:48:08 +11:00
|
|
|
file_accessed(file);
|
|
|
|
|
2017-04-13 14:13:36 -04:00
|
|
|
retval = mapping->a_ops->direct_IO(iocb, iter);
|
2016-10-10 13:26:27 -04:00
|
|
|
if (retval >= 0) {
|
2016-04-07 08:51:55 -07:00
|
|
|
iocb->ki_pos += retval;
|
2017-04-13 14:13:36 -04:00
|
|
|
count -= retval;
|
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 14:42:22 +00:00
|
|
|
}
|
2021-02-24 12:01:45 -08:00
|
|
|
if (retval != -EIOCBQUEUED)
|
|
|
|
iov_iter_revert(iter, count - iov_iter_count(iter));
|
2010-05-23 11:00:54 -04:00
|
|
|
|
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 14:42:22 +00:00
|
|
|
/*
|
|
|
|
* Btrfs can have a short DIO read if we encounter
|
|
|
|
* compressed extents, so if there was an error, or if
|
|
|
|
* we've already read everything we wanted to, or if
|
|
|
|
* there was a short read because we hit EOF, go ahead
|
|
|
|
* and return. Otherwise fallthrough to buffered io for
|
2015-02-16 15:58:53 -08:00
|
|
|
* the rest of the read. Buffered reads will not work for
|
|
|
|
* DAX files, so don't bother trying.
|
Fix race when checking i_size on direct i/o read
So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?
Steve.
---------------------
Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.
Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:
a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue
b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.
c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.
The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.
There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24 14:42:22 +00:00
|
|
|
*/
|
2021-11-05 13:37:07 -07:00
|
|
|
if (retval < 0 || !count || IS_DAX(inode))
|
|
|
|
return retval;
|
|
|
|
if (iocb->ki_pos >= i_size_read(inode))
|
2021-02-24 12:02:45 -08:00
|
|
|
return retval;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2021-02-24 12:02:45 -08:00
|
|
|
return filemap_read(iocb, iter, retval);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2014-03-05 22:53:04 -05:00
|
|
|
EXPORT_SYMBOL(generic_file_read_iter);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-02-14 15:01:42 +00:00
|
|
|
/*
|
|
|
|
* Splice subpages from a folio into a pipe.
|
|
|
|
*/
|
|
|
|
size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
|
|
|
|
struct folio *folio, loff_t fpos, size_t size)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
size_t spliced = 0, offset = offset_in_folio(folio, fpos);
|
|
|
|
|
|
|
|
page = folio_page(folio, offset / PAGE_SIZE);
|
|
|
|
size = min(size, folio_size(folio) - offset);
|
|
|
|
offset %= PAGE_SIZE;
|
|
|
|
|
|
|
|
while (spliced < size &&
|
|
|
|
!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
|
|
|
|
struct pipe_buffer *buf = pipe_head_buf(pipe);
|
|
|
|
size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced);
|
|
|
|
|
|
|
|
*buf = (struct pipe_buffer) {
|
|
|
|
.ops = &page_cache_pipe_buf_ops,
|
|
|
|
.page = page,
|
|
|
|
.offset = offset,
|
|
|
|
.len = part,
|
|
|
|
};
|
|
|
|
folio_get(folio);
|
|
|
|
pipe->head++;
|
|
|
|
page++;
|
|
|
|
spliced += part;
|
|
|
|
offset = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return spliced;
|
|
|
|
}
|
|
|
|
|
2023-05-22 14:50:18 +01:00
|
|
|
/**
|
|
|
|
* filemap_splice_read - Splice data from a file's pagecache into a pipe
|
|
|
|
* @in: The file to read from
|
|
|
|
* @ppos: Pointer to the file position to read from
|
|
|
|
* @pipe: The pipe to splice into
|
|
|
|
* @len: The amount to splice
|
|
|
|
* @flags: The SPLICE_F_* flags
|
|
|
|
*
|
|
|
|
* This function gets folios from a file's pagecache and splices them into the
|
|
|
|
* pipe. Readahead will be called as necessary to fill more folios. This may
|
|
|
|
* be used for blockdevs also.
|
|
|
|
*
|
|
|
|
* Return: On success, the number of bytes read will be returned and *@ppos
|
|
|
|
* will be updated if appropriate; 0 will be returned if there is no more data
|
|
|
|
* to be read; -EAGAIN will be returned if the pipe had no space, and some
|
|
|
|
* other negative error code will be returned on error. A short read may occur
|
|
|
|
* if the pipe has insufficient space, we reach the end of the data or we hit a
|
|
|
|
* hole.
|
2023-02-14 15:01:42 +00:00
|
|
|
*/
|
|
|
|
ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
|
|
|
|
struct pipe_inode_info *pipe,
|
|
|
|
size_t len, unsigned int flags)
|
|
|
|
{
|
|
|
|
struct folio_batch fbatch;
|
|
|
|
struct kiocb iocb;
|
|
|
|
size_t total_spliced = 0, used, npages;
|
|
|
|
loff_t isize, end_offset;
|
|
|
|
bool writably_mapped;
|
|
|
|
int i, error = 0;
|
|
|
|
|
2023-05-22 14:49:49 +01:00
|
|
|
if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
|
|
|
|
return 0;
|
|
|
|
|
2023-02-14 15:01:42 +00:00
|
|
|
init_sync_kiocb(&iocb, in);
|
|
|
|
iocb.ki_pos = *ppos;
|
|
|
|
|
|
|
|
/* Work out how much data we can actually add into the pipe */
|
|
|
|
used = pipe_occupancy(pipe->head, pipe->tail);
|
|
|
|
npages = max_t(ssize_t, pipe->max_usage - used, 0);
|
|
|
|
len = min_t(size_t, len, npages * PAGE_SIZE);
|
|
|
|
|
|
|
|
folio_batch_init(&fbatch);
|
|
|
|
|
|
|
|
do {
|
|
|
|
cond_resched();
|
|
|
|
|
2023-05-22 14:49:48 +01:00
|
|
|
if (*ppos >= i_size_read(in->f_mapping->host))
|
2023-02-14 15:01:42 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
iocb.ki_pos = *ppos;
|
|
|
|
error = filemap_get_pages(&iocb, len, &fbatch, true);
|
|
|
|
if (error < 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* i_size must be checked after we know the pages are Uptodate.
|
|
|
|
*
|
|
|
|
* Checking i_size after the check allows us to calculate
|
|
|
|
* the correct value for "nr", which means the zero-filled
|
|
|
|
* part of the page is not copied back to userspace (unless
|
|
|
|
* another truncate extends the file - this is desired though).
|
|
|
|
*/
|
2023-05-22 14:49:48 +01:00
|
|
|
isize = i_size_read(in->f_mapping->host);
|
2023-02-14 15:01:42 +00:00
|
|
|
if (unlikely(*ppos >= isize))
|
|
|
|
break;
|
|
|
|
end_offset = min_t(loff_t, isize, *ppos + len);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Once we start copying data, we don't want to be touching any
|
|
|
|
* cachelines that might be contended:
|
|
|
|
*/
|
|
|
|
writably_mapped = mapping_writably_mapped(in->f_mapping);
|
|
|
|
|
|
|
|
for (i = 0; i < folio_batch_count(&fbatch); i++) {
|
|
|
|
struct folio *folio = fbatch.folios[i];
|
|
|
|
size_t n;
|
|
|
|
|
|
|
|
if (folio_pos(folio) >= end_offset)
|
|
|
|
goto out;
|
|
|
|
folio_mark_accessed(folio);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If users can be writing to this folio using arbitrary
|
|
|
|
* virtual addresses, take care of potential aliasing
|
|
|
|
* before reading the folio on the kernel side.
|
|
|
|
*/
|
|
|
|
if (writably_mapped)
|
|
|
|
flush_dcache_folio(folio);
|
|
|
|
|
|
|
|
n = min_t(loff_t, len, isize - *ppos);
|
|
|
|
n = splice_folio_into_pipe(pipe, folio, *ppos, n);
|
|
|
|
if (!n)
|
|
|
|
goto out;
|
|
|
|
len -= n;
|
|
|
|
total_spliced += n;
|
|
|
|
*ppos += n;
|
|
|
|
in->f_ra.prev_pos = *ppos;
|
|
|
|
if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
folio_batch_release(&fbatch);
|
|
|
|
} while (len);
|
|
|
|
|
|
|
|
out:
|
|
|
|
folio_batch_release(&fbatch);
|
|
|
|
file_accessed(in);
|
|
|
|
|
|
|
|
return total_spliced ? total_spliced : error;
|
|
|
|
}
|
2023-02-15 08:00:31 +00:00
|
|
|
EXPORT_SYMBOL(filemap_splice_read);
|
2023-02-14 15:01:42 +00:00
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
static inline loff_t folio_seek_hole_data(struct xa_state *xas,
|
|
|
|
struct address_space *mapping, struct folio *folio,
|
2021-02-25 17:15:52 -08:00
|
|
|
loff_t start, loff_t end, bool seek_data)
|
2021-02-25 17:15:48 -08:00
|
|
|
{
|
2021-02-25 17:15:52 -08:00
|
|
|
const struct address_space_operations *ops = mapping->a_ops;
|
|
|
|
size_t offset, bsz = i_blocksize(mapping->host);
|
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
if (xa_is_value(folio) || folio_test_uptodate(folio))
|
2021-02-25 17:15:52 -08:00
|
|
|
return seek_data ? start : end;
|
|
|
|
if (!ops->is_partially_uptodate)
|
|
|
|
return seek_data ? end : start;
|
|
|
|
|
|
|
|
xas_pause(xas);
|
|
|
|
rcu_read_unlock();
|
2020-12-17 00:12:26 -05:00
|
|
|
folio_lock(folio);
|
|
|
|
if (unlikely(folio->mapping != mapping))
|
2021-02-25 17:15:52 -08:00
|
|
|
goto unlock;
|
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
offset = offset_in_folio(folio, start) & ~(bsz - 1);
|
2021-02-25 17:15:52 -08:00
|
|
|
|
|
|
|
do {
|
2022-02-09 20:21:27 +00:00
|
|
|
if (ops->is_partially_uptodate(folio, offset, bsz) ==
|
2020-12-17 00:12:26 -05:00
|
|
|
seek_data)
|
2021-02-25 17:15:52 -08:00
|
|
|
break;
|
|
|
|
start = (start + bsz) & ~(bsz - 1);
|
|
|
|
offset += bsz;
|
2020-12-17 00:12:26 -05:00
|
|
|
} while (offset < folio_size(folio));
|
2021-02-25 17:15:52 -08:00
|
|
|
unlock:
|
2020-12-17 00:12:26 -05:00
|
|
|
folio_unlock(folio);
|
2021-02-25 17:15:52 -08:00
|
|
|
rcu_read_lock();
|
|
|
|
return start;
|
2021-02-25 17:15:48 -08:00
|
|
|
}
|
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
static inline size_t seek_folio_size(struct xa_state *xas, struct folio *folio)
|
2021-02-25 17:15:48 -08:00
|
|
|
{
|
2020-12-17 00:12:26 -05:00
|
|
|
if (xa_is_value(folio))
|
2024-09-06 16:05:12 -07:00
|
|
|
return PAGE_SIZE << xas_get_order(xas);
|
2020-12-17 00:12:26 -05:00
|
|
|
return folio_size(folio);
|
2021-02-25 17:15:48 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* mapping_seek_hole_data - Seek for SEEK_DATA / SEEK_HOLE in the page cache.
|
|
|
|
* @mapping: Address space to search.
|
|
|
|
* @start: First byte to consider.
|
|
|
|
* @end: Limit of search (exclusive).
|
|
|
|
* @whence: Either SEEK_HOLE or SEEK_DATA.
|
|
|
|
*
|
|
|
|
* If the page cache knows which blocks contain holes and which blocks
|
|
|
|
* contain data, your filesystem can use this function to implement
|
|
|
|
* SEEK_HOLE and SEEK_DATA. This is useful for filesystems which are
|
|
|
|
* entirely memory-based such as tmpfs, and filesystems which support
|
|
|
|
* unwritten extents.
|
|
|
|
*
|
2021-05-06 18:06:47 -07:00
|
|
|
* Return: The requested offset on success, or -ENXIO if @whence specifies
|
2021-02-25 17:15:48 -08:00
|
|
|
* SEEK_DATA and there is no data after @start. There is an implicit hole
|
|
|
|
* after @end - 1, so SEEK_HOLE returns @end if all the bytes between @start
|
|
|
|
* and @end contain data.
|
|
|
|
*/
|
|
|
|
loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start,
|
|
|
|
loff_t end, int whence)
|
|
|
|
{
|
|
|
|
XA_STATE(xas, &mapping->i_pages, start >> PAGE_SHIFT);
|
mm/filemap: fix mapping_seek_hole_data on THP & 32-bit
No problem on 64-bit, or without huge pages, but xfstests generic/285
and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on
32-bit architectures, with the new mapping_seek_hole_data(). Several
different bugs turned out to need fixing.
u64 cast to stop losing bits when converting unsigned long to loff_t
(and let's use shifts throughout, rather than mixed with * and /).
Use round_up() when advancing pos, to stop assuming that pos was already
THP-aligned when advancing it by THP-size. (This use of round_up()
assumes that any THP has THP-aligned index: true at present and true
going forward, but could be recoded to avoid the assumption.)
Use xas_set() when iterating away from a THP, so that xa_index stays in
synch with start, instead of drifting away to return bogus offset.
Check start against end to avoid wrapping 32-bit xa_index to 0 (and to
handle these additional cases, seek_data or not, it's easier to break
the loop than goto: so rearrange exit from the function).
[hughd@google.com: remove unneeded u64 casts, per Matthew]
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils
Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-23 14:29:00 -07:00
|
|
|
pgoff_t max = (end - 1) >> PAGE_SHIFT;
|
2021-02-25 17:15:48 -08:00
|
|
|
bool seek_data = (whence == SEEK_DATA);
|
2020-12-17 00:12:26 -05:00
|
|
|
struct folio *folio;
|
2021-02-25 17:15:48 -08:00
|
|
|
|
|
|
|
if (end <= start)
|
|
|
|
return -ENXIO;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2020-12-17 00:12:26 -05:00
|
|
|
while ((folio = find_get_entry(&xas, max, XA_PRESENT))) {
|
mm/filemap: fix mapping_seek_hole_data on THP & 32-bit
No problem on 64-bit, or without huge pages, but xfstests generic/285
and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on
32-bit architectures, with the new mapping_seek_hole_data(). Several
different bugs turned out to need fixing.
u64 cast to stop losing bits when converting unsigned long to loff_t
(and let's use shifts throughout, rather than mixed with * and /).
Use round_up() when advancing pos, to stop assuming that pos was already
THP-aligned when advancing it by THP-size. (This use of round_up()
assumes that any THP has THP-aligned index: true at present and true
going forward, but could be recoded to avoid the assumption.)
Use xas_set() when iterating away from a THP, so that xa_index stays in
synch with start, instead of drifting away to return bogus offset.
Check start against end to avoid wrapping 32-bit xa_index to 0 (and to
handle these additional cases, seek_data or not, it's easier to break
the loop than goto: so rearrange exit from the function).
[hughd@google.com: remove unneeded u64 casts, per Matthew]
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils
Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-23 14:29:00 -07:00
|
|
|
loff_t pos = (u64)xas.xa_index << PAGE_SHIFT;
|
2020-12-17 00:12:26 -05:00
|
|
|
size_t seek_size;
|
2021-02-25 17:15:48 -08:00
|
|
|
|
|
|
|
if (start < pos) {
|
|
|
|
if (!seek_data)
|
|
|
|
goto unlock;
|
|
|
|
start = pos;
|
|
|
|
}
|
|
|
|
|
2020-12-17 00:12:26 -05:00
|
|
|
seek_size = seek_folio_size(&xas, folio);
|
|
|
|
pos = round_up((u64)pos + 1, seek_size);
|
|
|
|
start = folio_seek_hole_data(&xas, mapping, folio, start, pos,
|
2021-02-25 17:15:52 -08:00
|
|
|
seek_data);
|
|
|
|
if (start < pos)
|
2021-02-25 17:15:48 -08:00
|
|
|
goto unlock;
|
mm/filemap: fix mapping_seek_hole_data on THP & 32-bit
No problem on 64-bit, or without huge pages, but xfstests generic/285
and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on
32-bit architectures, with the new mapping_seek_hole_data(). Several
different bugs turned out to need fixing.
u64 cast to stop losing bits when converting unsigned long to loff_t
(and let's use shifts throughout, rather than mixed with * and /).
Use round_up() when advancing pos, to stop assuming that pos was already
THP-aligned when advancing it by THP-size. (This use of round_up()
assumes that any THP has THP-aligned index: true at present and true
going forward, but could be recoded to avoid the assumption.)
Use xas_set() when iterating away from a THP, so that xa_index stays in
synch with start, instead of drifting away to return bogus offset.
Check start against end to avoid wrapping 32-bit xa_index to 0 (and to
handle these additional cases, seek_data or not, it's easier to break
the loop than goto: so rearrange exit from the function).
[hughd@google.com: remove unneeded u64 casts, per Matthew]
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils
Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-23 14:29:00 -07:00
|
|
|
if (start >= end)
|
|
|
|
break;
|
|
|
|
if (seek_size > PAGE_SIZE)
|
|
|
|
xas_set(&xas, pos >> PAGE_SHIFT);
|
2020-12-17 00:12:26 -05:00
|
|
|
if (!xa_is_value(folio))
|
|
|
|
folio_put(folio);
|
2021-02-25 17:15:48 -08:00
|
|
|
}
|
|
|
|
if (seek_data)
|
mm/filemap: fix mapping_seek_hole_data on THP & 32-bit
No problem on 64-bit, or without huge pages, but xfstests generic/285
and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on
32-bit architectures, with the new mapping_seek_hole_data(). Several
different bugs turned out to need fixing.
u64 cast to stop losing bits when converting unsigned long to loff_t
(and let's use shifts throughout, rather than mixed with * and /).
Use round_up() when advancing pos, to stop assuming that pos was already
THP-aligned when advancing it by THP-size. (This use of round_up()
assumes that any THP has THP-aligned index: true at present and true
going forward, but could be recoded to avoid the assumption.)
Use xas_set() when iterating away from a THP, so that xa_index stays in
synch with start, instead of drifting away to return bogus offset.
Check start against end to avoid wrapping 32-bit xa_index to 0 (and to
handle these additional cases, seek_data or not, it's easier to break
the loop than goto: so rearrange exit from the function).
[hughd@google.com: remove unneeded u64 casts, per Matthew]
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils
Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-23 14:29:00 -07:00
|
|
|
start = -ENXIO;
|
2021-02-25 17:15:48 -08:00
|
|
|
unlock:
|
|
|
|
rcu_read_unlock();
|
2020-12-17 00:12:26 -05:00
|
|
|
if (folio && !xa_is_value(folio))
|
|
|
|
folio_put(folio);
|
2021-02-25 17:15:48 -08:00
|
|
|
if (start > end)
|
|
|
|
return end;
|
|
|
|
return start;
|
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
#define MMAP_LOTSAMISS (100)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
/*
|
2021-03-10 10:46:41 -05:00
|
|
|
* lock_folio_maybe_drop_mmap - lock the page, possibly dropping the mmap_lock
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
* @vmf - the vm_fault for this fault.
|
2021-03-10 10:46:41 -05:00
|
|
|
* @folio - the folio to lock.
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
* @fpin - the pointer to the file we may pin (or is already pinned).
|
|
|
|
*
|
2021-03-10 10:46:41 -05:00
|
|
|
* This works similar to lock_folio_or_retry in that it can drop the
|
|
|
|
* mmap_lock. It differs in that it actually returns the folio locked
|
|
|
|
* if it returns 1 and 0 if it couldn't lock the folio. If we did have
|
|
|
|
* to drop the mmap_lock then fpin will point to the pinned file and
|
|
|
|
* needs to be fput()'ed at a later point.
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
*/
|
2021-03-10 10:46:41 -05:00
|
|
|
static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio,
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
struct file **fpin)
|
|
|
|
{
|
2021-03-01 19:38:25 -05:00
|
|
|
if (folio_trylock(folio))
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
return 1;
|
|
|
|
|
2019-03-15 11:26:07 -07:00
|
|
|
/*
|
|
|
|
* NOTE! This will make us return with VM_FAULT_RETRY, but with
|
mm: make lock_folio_maybe_drop_mmap() VMA lock aware
Patch series "Handle more faults under the VMA lock", v2.
At this point, we're handling the majority of file-backed page faults
under the VMA lock, using the ->map_pages entry point. This patch set
attempts to expand that for the following siutations:
- We have to do a read. This could be because we've hit the point in
the readahead window where we need to kick off the next readahead,
or because the page is simply not present in cache.
- We're handling a write fault. Most applications don't do I/O by writes
to shared mmaps for very good reasons, but some do, and it'd be nice
to not make that slow unnecessarily.
- We're doing a COW of a private mapping (both PTE already present
and PTE not-present). These are two different codepaths and I handle
both of them in this patch set.
There is no support in this patch set for drivers to mark themselves as
being VMA lock friendly; they could implement the ->map_pages
vm_operation, but if they do, they would be the first. This is probably
something we want to change at some point in the future, and I've marked
where to make that change in the code.
There is very little performance change in the benchmarks we've run;
mostly because the vast majority of page faults are handled through the
other paths. I still think this patch series is useful for workloads that
may take these paths more often, and just for cleaning up the fault path
in general (it's now clearer why we have to retry in these cases).
This patch (of 6):
Drop the VMA lock instead of the mmap_lock if that's the one which
is held.
Link: https://lkml.kernel.org/r/20231006195318.4087158-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231006195318.4087158-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 20:53:13 +01:00
|
|
|
* the fault lock still held. That's how FAULT_FLAG_RETRY_NOWAIT
|
2019-03-15 11:26:07 -07:00
|
|
|
* is supposed to work. We have way too many special cases..
|
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
*fpin = maybe_unlock_mmap_for_io(vmf, *fpin);
|
|
|
|
if (vmf->flags & FAULT_FLAG_KILLABLE) {
|
2020-12-08 00:07:31 -05:00
|
|
|
if (__folio_lock_killable(folio)) {
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
/*
|
mm: make lock_folio_maybe_drop_mmap() VMA lock aware
Patch series "Handle more faults under the VMA lock", v2.
At this point, we're handling the majority of file-backed page faults
under the VMA lock, using the ->map_pages entry point. This patch set
attempts to expand that for the following siutations:
- We have to do a read. This could be because we've hit the point in
the readahead window where we need to kick off the next readahead,
or because the page is simply not present in cache.
- We're handling a write fault. Most applications don't do I/O by writes
to shared mmaps for very good reasons, but some do, and it'd be nice
to not make that slow unnecessarily.
- We're doing a COW of a private mapping (both PTE already present
and PTE not-present). These are two different codepaths and I handle
both of them in this patch set.
There is no support in this patch set for drivers to mark themselves as
being VMA lock friendly; they could implement the ->map_pages
vm_operation, but if they do, they would be the first. This is probably
something we want to change at some point in the future, and I've marked
where to make that change in the code.
There is very little performance change in the benchmarks we've run;
mostly because the vast majority of page faults are handled through the
other paths. I still think this patch series is useful for workloads that
may take these paths more often, and just for cleaning up the fault path
in general (it's now clearer why we have to retry in these cases).
This patch (of 6):
Drop the VMA lock instead of the mmap_lock if that's the one which
is held.
Link: https://lkml.kernel.org/r/20231006195318.4087158-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231006195318.4087158-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 20:53:13 +01:00
|
|
|
* We didn't have the right flags to drop the
|
|
|
|
* fault lock, but all fault_handlers only check
|
|
|
|
* for fatal signals if we return VM_FAULT_RETRY,
|
|
|
|
* so we need to drop the fault lock here and
|
|
|
|
* return 0 if we don't have a fpin.
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
*/
|
|
|
|
if (*fpin == NULL)
|
mm: make lock_folio_maybe_drop_mmap() VMA lock aware
Patch series "Handle more faults under the VMA lock", v2.
At this point, we're handling the majority of file-backed page faults
under the VMA lock, using the ->map_pages entry point. This patch set
attempts to expand that for the following siutations:
- We have to do a read. This could be because we've hit the point in
the readahead window where we need to kick off the next readahead,
or because the page is simply not present in cache.
- We're handling a write fault. Most applications don't do I/O by writes
to shared mmaps for very good reasons, but some do, and it'd be nice
to not make that slow unnecessarily.
- We're doing a COW of a private mapping (both PTE already present
and PTE not-present). These are two different codepaths and I handle
both of them in this patch set.
There is no support in this patch set for drivers to mark themselves as
being VMA lock friendly; they could implement the ->map_pages
vm_operation, but if they do, they would be the first. This is probably
something we want to change at some point in the future, and I've marked
where to make that change in the code.
There is very little performance change in the benchmarks we've run;
mostly because the vast majority of page faults are handled through the
other paths. I still think this patch series is useful for workloads that
may take these paths more often, and just for cleaning up the fault path
in general (it's now clearer why we have to retry in these cases).
This patch (of 6):
Drop the VMA lock instead of the mmap_lock if that's the one which
is held.
Link: https://lkml.kernel.org/r/20231006195318.4087158-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231006195318.4087158-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 20:53:13 +01:00
|
|
|
release_fault_lock(vmf);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
} else
|
2021-03-01 19:38:25 -05:00
|
|
|
__folio_lock(folio);
|
|
|
|
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2009-06-16 15:31:25 -07:00
|
|
|
/*
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
* Synchronous readahead happens when we don't even find a page in the page
|
|
|
|
* cache at all. We don't want to perform IO under the mmap sem, so if we have
|
|
|
|
* to drop the mmap sem we return the file that was pinned in order for us to do
|
|
|
|
* that. If we didn't pin a file then we return NULL. The file that is
|
|
|
|
* returned needs to be fput()'ed when we're done with it.
|
2009-06-16 15:31:25 -07:00
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
|
2009-06-16 15:31:25 -07:00
|
|
|
{
|
2019-03-13 11:44:18 -07:00
|
|
|
struct file *file = vmf->vma->vm_file;
|
|
|
|
struct file_ra_state *ra = &file->f_ra;
|
2009-06-16 15:31:25 -07:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
2021-04-07 21:18:55 +01:00
|
|
|
DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
struct file *fpin = NULL;
|
2022-05-25 14:23:45 -04:00
|
|
|
unsigned long vm_flags = vmf->vma->vm_flags;
|
2020-08-14 17:31:27 -07:00
|
|
|
unsigned int mmap_miss;
|
2009-06-16 15:31:25 -07:00
|
|
|
|
2021-07-24 23:37:13 -04:00
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
/* Use the readahead code, even if readahead is disabled */
|
2024-06-27 10:39:51 +10:00
|
|
|
if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
|
2021-07-24 23:37:13 -04:00
|
|
|
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
|
|
|
|
ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
|
|
|
|
ra->size = HPAGE_PMD_NR;
|
|
|
|
/*
|
|
|
|
* Fetch two PMD folios, so we get the chance to actually
|
|
|
|
* readahead, unless we've been told not to.
|
|
|
|
*/
|
2022-05-25 14:23:45 -04:00
|
|
|
if (!(vm_flags & VM_RAND_READ))
|
2021-07-24 23:37:13 -04:00
|
|
|
ra->size *= 2;
|
|
|
|
ra->async_size = HPAGE_PMD_NR;
|
|
|
|
page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
|
|
|
|
return fpin;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-06-16 15:31:25 -07:00
|
|
|
/* If we don't want any read-ahead, don't bother */
|
2022-05-25 14:23:45 -04:00
|
|
|
if (vm_flags & VM_RAND_READ)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
return fpin;
|
2011-05-24 17:12:28 -07:00
|
|
|
if (!ra->ra_pages)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
return fpin;
|
2009-06-16 15:31:25 -07:00
|
|
|
|
2022-05-25 14:23:45 -04:00
|
|
|
if (vm_flags & VM_SEQ_READ) {
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
|
2021-04-07 21:18:55 +01:00
|
|
|
page_cache_sync_ra(&ractl, ra->ra_pages);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
return fpin;
|
2009-06-16 15:31:25 -07:00
|
|
|
}
|
|
|
|
|
2011-05-24 17:12:29 -07:00
|
|
|
/* Avoid banging the cache line if not needed */
|
2020-08-14 17:31:27 -07:00
|
|
|
mmap_miss = READ_ONCE(ra->mmap_miss);
|
|
|
|
if (mmap_miss < MMAP_LOTSAMISS * 10)
|
|
|
|
WRITE_ONCE(ra->mmap_miss, ++mmap_miss);
|
2009-06-16 15:31:25 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Do we miss much more than hit in this file? If so,
|
|
|
|
* stop bothering with read-ahead. It will only hurt.
|
|
|
|
*/
|
2020-08-14 17:31:27 -07:00
|
|
|
if (mmap_miss > MMAP_LOTSAMISS)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
return fpin;
|
2009-06-16 15:31:25 -07:00
|
|
|
|
2009-06-16 15:31:30 -07:00
|
|
|
/*
|
|
|
|
* mmap read-around
|
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
|
2020-10-15 20:06:31 -07:00
|
|
|
ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
|
2015-11-05 18:47:08 -08:00
|
|
|
ra->size = ra->ra_pages;
|
|
|
|
ra->async_size = ra->ra_pages / 4;
|
2020-10-15 20:06:31 -07:00
|
|
|
ractl._index = ra->start;
|
2021-07-24 23:26:14 -04:00
|
|
|
page_cache_ra_order(&ractl, ra, 0);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
return fpin;
|
2009-06-16 15:31:25 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Asynchronous readahead happens when we find the page and PG_readahead,
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
* so we want to possibly extend the readahead further. We return the file that
|
2020-06-08 21:33:54 -07:00
|
|
|
* was pinned if we have to drop the mmap_lock in order to do IO.
|
2009-06-16 15:31:25 -07:00
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
|
2021-07-29 14:57:01 -04:00
|
|
|
struct folio *folio)
|
2009-06-16 15:31:25 -07:00
|
|
|
{
|
2019-03-13 11:44:18 -07:00
|
|
|
struct file *file = vmf->vma->vm_file;
|
|
|
|
struct file_ra_state *ra = &file->f_ra;
|
2021-07-29 14:57:01 -04:00
|
|
|
DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
struct file *fpin = NULL;
|
2020-08-14 17:31:27 -07:00
|
|
|
unsigned int mmap_miss;
|
2009-06-16 15:31:25 -07:00
|
|
|
|
|
|
|
/* If we don't want any read-ahead, don't bother */
|
2020-04-01 21:04:40 -07:00
|
|
|
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
return fpin;
|
2021-07-29 14:57:01 -04:00
|
|
|
|
2020-08-14 17:31:27 -07:00
|
|
|
mmap_miss = READ_ONCE(ra->mmap_miss);
|
|
|
|
if (mmap_miss)
|
|
|
|
WRITE_ONCE(ra->mmap_miss, --mmap_miss);
|
2021-07-29 14:57:01 -04:00
|
|
|
|
|
|
|
if (folio_test_readahead(folio)) {
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
|
2021-07-29 14:57:01 -04:00
|
|
|
page_cache_async_ra(&ractl, folio, ra->ra_pages);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
}
|
|
|
|
return fpin;
|
2009-06-16 15:31:25 -07:00
|
|
|
}
|
|
|
|
|
filemap: avoid unnecessary major faults in filemap_fault()
A major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) in
application, which leading to an unexpected issue[1].
This is caused by temporarily cleared PTE during a read+clear/modify/write
update of the PTE, eg, do_numa_page()/change_pte_range().
For the data segment of the user-mode program, the global variable area is
a private mapping. After the pagecache is loaded, the private anonymous
page is generated after the COW is triggered. Mlockall can lock COW pages
(anonymous pages), but the original file pages cannot be locked and may be
reclaimed. If the global variable (private anon page) is accessed when
vmf->pte is zeroed in numa fault, a file page fault will be triggered. At
this time, the original private file page may have been reclaimed. If the
page cache is not available at this time, a major fault will be triggered
and the file will be read, causing additional overhead.
This issue affects our traffic analysis service. The inbound traffic is
heavy. If a major fault occurs, the I/O schedule is triggered and the
original I/O is suspended. Generally, the I/O schedule is 0.7 ms. If
other applications are operating disks, the system needs to wait for more
than 10 ms. However, the inbound traffic is heavy and the NIC buffer is
small. As a result, packet loss occurs. But the traffic analysis service
can't tolerate packet loss.
Fix this by holding PTL and rechecking the PTE in filemap_fault() before
triggering a major fault. We do this check only if vma is VM_LOCKED to
reduce the performance impact in common scenarios.
In our product environment, there were 7 major faults every 12 hours.
After the patch is applied, no major fault have been triggered.
Testing file page read and write page fault performance in ext4 and
ramdisk using will-it-scale[2] on a x86 physical machine. The data is the
average change compared with the mainline after the patch is applied. The
test results are within the range of fluctuation. We do this check only
if vma is VM_LOCKED, therefore, no performance regressions is caused for
most common cases.
The test results are as follows:
processes processes_idle threads threads_idle
ext4 private file write: 0.22% 0.26% 1.21% -0.15%
ext4 private file read: 0.03% 1.00% 1.39% 0.34%
ext4 shared file write: -0.50% -0.02% -0.14% -0.02%
ramdisk private file write: 0.07% 0.02% 0.53% 0.04%
ramdisk private file read: 0.01% 1.60% -0.32% -0.02%
[1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/
[2] https://github.com/antonblanchard/will-it-scale/
Link: https://lkml.kernel.org/r/20240306083809.1236634-1-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 16:38:09 +08:00
|
|
|
static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
|
|
|
|
{
|
|
|
|
struct vm_area_struct *vma = vmf->vma;
|
|
|
|
vm_fault_t ret = 0;
|
|
|
|
pte_t *ptep;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We might have COW'ed a pagecache folio and might now have an mlocked
|
|
|
|
* anon folio mapped. The original pagecache folio is not mlocked and
|
|
|
|
* might have been evicted. During a read+clear/modify/write update of
|
|
|
|
* the PTE, such as done in do_numa_page()/change_pte_range(), we
|
|
|
|
* temporarily clear the PTE under PT lock and might detect it here as
|
|
|
|
* "none" when not holding the PT lock.
|
|
|
|
*
|
|
|
|
* Not rechecking the PTE under PT lock could result in an unexpected
|
|
|
|
* major fault in an mlock'ed region. Recheck only for this special
|
|
|
|
* scenario while holding the PT lock, to not degrade non-mlocked
|
|
|
|
* scenarios. Recheck the PTE without PT lock firstly, thereby reducing
|
|
|
|
* the number of times we hold PT lock.
|
|
|
|
*/
|
|
|
|
if (!(vma->vm_flags & VM_LOCKED))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
|
|
|
|
return 0;
|
|
|
|
|
2024-03-13 09:29:13 +08:00
|
|
|
ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address,
|
|
|
|
&vmf->ptl);
|
filemap: avoid unnecessary major faults in filemap_fault()
A major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) in
application, which leading to an unexpected issue[1].
This is caused by temporarily cleared PTE during a read+clear/modify/write
update of the PTE, eg, do_numa_page()/change_pte_range().
For the data segment of the user-mode program, the global variable area is
a private mapping. After the pagecache is loaded, the private anonymous
page is generated after the COW is triggered. Mlockall can lock COW pages
(anonymous pages), but the original file pages cannot be locked and may be
reclaimed. If the global variable (private anon page) is accessed when
vmf->pte is zeroed in numa fault, a file page fault will be triggered. At
this time, the original private file page may have been reclaimed. If the
page cache is not available at this time, a major fault will be triggered
and the file will be read, causing additional overhead.
This issue affects our traffic analysis service. The inbound traffic is
heavy. If a major fault occurs, the I/O schedule is triggered and the
original I/O is suspended. Generally, the I/O schedule is 0.7 ms. If
other applications are operating disks, the system needs to wait for more
than 10 ms. However, the inbound traffic is heavy and the NIC buffer is
small. As a result, packet loss occurs. But the traffic analysis service
can't tolerate packet loss.
Fix this by holding PTL and rechecking the PTE in filemap_fault() before
triggering a major fault. We do this check only if vma is VM_LOCKED to
reduce the performance impact in common scenarios.
In our product environment, there were 7 major faults every 12 hours.
After the patch is applied, no major fault have been triggered.
Testing file page read and write page fault performance in ext4 and
ramdisk using will-it-scale[2] on a x86 physical machine. The data is the
average change compared with the mainline after the patch is applied. The
test results are within the range of fluctuation. We do this check only
if vma is VM_LOCKED, therefore, no performance regressions is caused for
most common cases.
The test results are as follows:
processes processes_idle threads threads_idle
ext4 private file write: 0.22% 0.26% 1.21% -0.15%
ext4 private file read: 0.03% 1.00% 1.39% 0.34%
ext4 shared file write: -0.50% -0.02% -0.14% -0.02%
ramdisk private file write: 0.07% 0.02% 0.53% 0.04%
ramdisk private file read: 0.01% 1.60% -0.32% -0.02%
[1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/
[2] https://github.com/antonblanchard/will-it-scale/
Link: https://lkml.kernel.org/r/20240306083809.1236634-1-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 16:38:09 +08:00
|
|
|
if (unlikely(!ptep))
|
|
|
|
return VM_FAULT_NOPAGE;
|
|
|
|
|
|
|
|
if (unlikely(!pte_none(ptep_get_lockless(ptep)))) {
|
|
|
|
ret = VM_FAULT_NOPAGE;
|
|
|
|
} else {
|
|
|
|
spin_lock(vmf->ptl);
|
|
|
|
if (unlikely(!pte_none(ptep_get(ptep))))
|
|
|
|
ret = VM_FAULT_NOPAGE;
|
|
|
|
spin_unlock(vmf->ptl);
|
|
|
|
}
|
|
|
|
pte_unmap(ptep);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2006-06-23 02:03:49 -07:00
|
|
|
/**
|
2007-07-19 01:46:59 -07:00
|
|
|
* filemap_fault - read in file data for page fault handling
|
2007-07-19 01:47:03 -07:00
|
|
|
* @vmf: struct vm_fault containing details of the fault
|
2006-06-23 02:03:49 -07:00
|
|
|
*
|
2007-07-19 01:46:59 -07:00
|
|
|
* filemap_fault() is invoked via the vma operations vector for a
|
2005-04-16 15:20:36 -07:00
|
|
|
* mapped memory region to read in file data during a page fault.
|
|
|
|
*
|
|
|
|
* The goto's are kind of ugly, but this streamlines the normal case of having
|
|
|
|
* it in the page cache, and handles the special cases reasonably without
|
|
|
|
* having a lot of duplicated code.
|
2014-08-06 16:07:24 -07:00
|
|
|
*
|
2020-06-08 21:33:54 -07:00
|
|
|
* vma->vm_mm->mmap_lock must be held on entry.
|
2014-08-06 16:07:24 -07:00
|
|
|
*
|
2020-06-08 21:33:54 -07:00
|
|
|
* If our return value has VM_FAULT_RETRY set, it's because the mmap_lock
|
2021-03-10 10:46:41 -05:00
|
|
|
* may be dropped before doing I/O or by lock_folio_maybe_drop_mmap().
|
2014-08-06 16:07:24 -07:00
|
|
|
*
|
2020-06-08 21:33:54 -07:00
|
|
|
* If our return value does not have VM_FAULT_RETRY set, the mmap_lock
|
2014-08-06 16:07:24 -07:00
|
|
|
* has not been released.
|
|
|
|
*
|
|
|
|
* We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: bitwise-OR of %VM_FAULT_ codes.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2018-06-07 17:08:00 -07:00
|
|
|
vm_fault_t filemap_fault(struct vm_fault *vmf)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
int error;
|
2017-02-24 14:56:41 -08:00
|
|
|
struct file *file = vmf->vma->vm_file;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
struct file *fpin = NULL;
|
2005-04-16 15:20:36 -07:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
struct inode *inode = mapping->host;
|
2021-03-10 10:46:41 -05:00
|
|
|
pgoff_t max_idx, index = vmf->pgoff;
|
|
|
|
struct folio *folio;
|
2018-06-07 17:08:00 -07:00
|
|
|
vm_fault_t ret = 0;
|
2021-01-28 19:19:45 +01:00
|
|
|
bool mapping_locked = false;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2021-03-10 10:46:41 -05:00
|
|
|
max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
|
|
|
|
if (unlikely(index >= max_idx))
|
2007-10-31 09:19:46 -07:00
|
|
|
return VM_FAULT_SIGBUS;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
filemap: add trace events for get_pages, map_pages, and fault
To allow precise tracking of page caches accessed, add new tracepoints
that trigger when a process actually accesses them.
The ureadahead program used by ChromeOS traces the disk access of programs
as they start up at boot up. It uses mincore(2) or the
'mm_filemap_add_to_page_cache' trace event to accomplish this. It stores
this information in a "pack" file and on subsequent boots, it will read
the pack file and call readahead(2) on the information so that disk
storage can be loaded into RAM before the applications actually need it.
A problem we see is that due to the kernel's readahead algorithm that can
aggressively pull in more data than needed (to try and accomplish the same
goal) and this data is also recorded. The end result is that the pack
file contains a lot of pages on disk that are never actually used.
Calling readahead(2) on these unused pages can slow down the system boot
up times.
To solve this, add 3 new trace events, get_pages, map_pages, and fault.
These will be used to trace the pages are not only pulled in from disk,
but are actually used by the application. Only those pages will be stored
in the pack file, and this helps out the performance of boot up.
With the combination of these 3 new trace events and
mm_filemap_add_to_page_cache, we observed a reduction in the pack file by
7.3% - 20% on ChromeOS varying by device.
Link: https://lkml.kernel.org/r/20240813100312.3930505-1-takayas@chromium.org
Signed-off-by: Takaya Saeki <takayas@chromium.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Junichi Uekawa <uekawa@chromium.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-13 10:03:12 +00:00
|
|
|
trace_mm_filemap_fault(mapping, index);
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
2013-10-16 13:46:59 -07:00
|
|
|
* Do we have something in the page cache already?
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2021-03-10 10:46:41 -05:00
|
|
|
folio = filemap_get_folio(mapping, index);
|
2023-03-07 15:34:10 +01:00
|
|
|
if (likely(!IS_ERR(folio))) {
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
2021-01-28 19:19:45 +01:00
|
|
|
* We found the page, so try async readahead before waiting for
|
|
|
|
* the lock.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2021-01-28 19:19:45 +01:00
|
|
|
if (!(vmf->flags & FAULT_FLAG_TRIED))
|
2021-07-29 14:57:01 -04:00
|
|
|
fpin = do_async_mmap_readahead(vmf, folio);
|
2021-03-10 10:46:41 -05:00
|
|
|
if (unlikely(!folio_test_uptodate(folio))) {
|
2021-01-28 19:19:45 +01:00
|
|
|
filemap_invalidate_lock_shared(mapping);
|
|
|
|
mapping_locked = true;
|
|
|
|
}
|
|
|
|
} else {
|
filemap: avoid unnecessary major faults in filemap_fault()
A major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) in
application, which leading to an unexpected issue[1].
This is caused by temporarily cleared PTE during a read+clear/modify/write
update of the PTE, eg, do_numa_page()/change_pte_range().
For the data segment of the user-mode program, the global variable area is
a private mapping. After the pagecache is loaded, the private anonymous
page is generated after the COW is triggered. Mlockall can lock COW pages
(anonymous pages), but the original file pages cannot be locked and may be
reclaimed. If the global variable (private anon page) is accessed when
vmf->pte is zeroed in numa fault, a file page fault will be triggered. At
this time, the original private file page may have been reclaimed. If the
page cache is not available at this time, a major fault will be triggered
and the file will be read, causing additional overhead.
This issue affects our traffic analysis service. The inbound traffic is
heavy. If a major fault occurs, the I/O schedule is triggered and the
original I/O is suspended. Generally, the I/O schedule is 0.7 ms. If
other applications are operating disks, the system needs to wait for more
than 10 ms. However, the inbound traffic is heavy and the NIC buffer is
small. As a result, packet loss occurs. But the traffic analysis service
can't tolerate packet loss.
Fix this by holding PTL and rechecking the PTE in filemap_fault() before
triggering a major fault. We do this check only if vma is VM_LOCKED to
reduce the performance impact in common scenarios.
In our product environment, there were 7 major faults every 12 hours.
After the patch is applied, no major fault have been triggered.
Testing file page read and write page fault performance in ext4 and
ramdisk using will-it-scale[2] on a x86 physical machine. The data is the
average change compared with the mainline after the patch is applied. The
test results are within the range of fluctuation. We do this check only
if vma is VM_LOCKED, therefore, no performance regressions is caused for
most common cases.
The test results are as follows:
processes processes_idle threads threads_idle
ext4 private file write: 0.22% 0.26% 1.21% -0.15%
ext4 private file read: 0.03% 1.00% 1.39% 0.34%
ext4 shared file write: -0.50% -0.02% -0.14% -0.02%
ramdisk private file write: 0.07% 0.02% 0.53% 0.04%
ramdisk private file read: 0.01% 1.60% -0.32% -0.02%
[1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/
[2] https://github.com/antonblanchard/will-it-scale/
Link: https://lkml.kernel.org/r/20240306083809.1236634-1-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 16:38:09 +08:00
|
|
|
ret = filemap_fault_recheck_pte_none(vmf);
|
|
|
|
if (unlikely(ret))
|
|
|
|
return ret;
|
|
|
|
|
2009-06-16 15:31:25 -07:00
|
|
|
/* No page in the page cache at all */
|
|
|
|
count_vm_event(PGMAJFAULT);
|
2017-07-06 15:40:25 -07:00
|
|
|
count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
|
2009-06-16 15:31:25 -07:00
|
|
|
ret = VM_FAULT_MAJOR;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
fpin = do_sync_mmap_readahead(vmf);
|
2009-06-16 15:31:25 -07:00
|
|
|
retry_find:
|
2021-01-28 19:19:45 +01:00
|
|
|
/*
|
2021-03-10 10:46:41 -05:00
|
|
|
* See comment in filemap_create_folio() why we need
|
2021-01-28 19:19:45 +01:00
|
|
|
* invalidate_lock
|
|
|
|
*/
|
|
|
|
if (!mapping_locked) {
|
|
|
|
filemap_invalidate_lock_shared(mapping);
|
|
|
|
mapping_locked = true;
|
|
|
|
}
|
2021-03-10 10:46:41 -05:00
|
|
|
folio = __filemap_get_folio(mapping, index,
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:14 -07:00
|
|
|
FGP_CREAT|FGP_FOR_MMAP,
|
|
|
|
vmf->gfp_mask);
|
2023-03-07 15:34:10 +01:00
|
|
|
if (IS_ERR(folio)) {
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
if (fpin)
|
|
|
|
goto out_retry;
|
2021-01-28 19:19:45 +01:00
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
2020-04-01 21:04:53 -07:00
|
|
|
return VM_FAULT_OOM;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2021-03-10 10:46:41 -05:00
|
|
|
if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
goto out_retry;
|
2010-10-26 14:21:56 -07:00
|
|
|
|
|
|
|
/* Did it get truncated? */
|
2021-03-10 10:46:41 -05:00
|
|
|
if (unlikely(folio->mapping != mapping)) {
|
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2010-10-26 14:21:56 -07:00
|
|
|
goto retry_find;
|
|
|
|
}
|
2021-03-10 10:46:41 -05:00
|
|
|
VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
|
2010-10-26 14:21:56 -07:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
mm/filemap: clarify filemap_fault() comments for not uptodate case
The existing comments in filemap_fault() suggest that, after either a
minor fault has occurred and filemap_get_folio() found a folio in the page
cache, or a major fault arose and __filemap_get_folio(FGP_CREATE...) did
the job (having relied on do_sync_mmap_readahead() or filemap_read_folio()
to read in the folio), the only possible reason it could not be uptodate
is because of an error.
This is not so, as if, for instance, the fault occurred within a VMA which
had the VM_RAND_READ flag set (via madvise() with the MADV_RANDOM flag
specified), this would cause even synchronous readahead to fail to read in
the folio.
I confirmed this by dropping page caches and faulting in memory
madvise()'d this way, observing that this code path was reached on each
occasion.
Clarify the comments to include this case, and additionally update the
comment recently added around the invalidate lock logic to make it clear
the comment explicitly refers to the minor fault case.
In addition, while we're here, refer to folios rather than pages.
[lstoakes@gmail.com: correct identation as per Christopher's feedback]
Link: https://lkml.kernel.org/r/2c7014c0-6343-4e76-8697-3f84f54350bd@lucifer.local
Link: https://lkml.kernel.org/r/20230930231029.88196-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-01 00:10:29 +01:00
|
|
|
* We have a locked folio in the page cache, now we need to check
|
|
|
|
* that it's up-to-date. If not, it is going to be due to an error,
|
|
|
|
* or because readahead was otherwise unable to retrieve it.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2021-03-10 10:46:41 -05:00
|
|
|
if (unlikely(!folio_test_uptodate(folio))) {
|
2021-01-28 19:19:45 +01:00
|
|
|
/*
|
mm/filemap: clarify filemap_fault() comments for not uptodate case
The existing comments in filemap_fault() suggest that, after either a
minor fault has occurred and filemap_get_folio() found a folio in the page
cache, or a major fault arose and __filemap_get_folio(FGP_CREATE...) did
the job (having relied on do_sync_mmap_readahead() or filemap_read_folio()
to read in the folio), the only possible reason it could not be uptodate
is because of an error.
This is not so, as if, for instance, the fault occurred within a VMA which
had the VM_RAND_READ flag set (via madvise() with the MADV_RANDOM flag
specified), this would cause even synchronous readahead to fail to read in
the folio.
I confirmed this by dropping page caches and faulting in memory
madvise()'d this way, observing that this code path was reached on each
occasion.
Clarify the comments to include this case, and additionally update the
comment recently added around the invalidate lock logic to make it clear
the comment explicitly refers to the minor fault case.
In addition, while we're here, refer to folios rather than pages.
[lstoakes@gmail.com: correct identation as per Christopher's feedback]
Link: https://lkml.kernel.org/r/2c7014c0-6343-4e76-8697-3f84f54350bd@lucifer.local
Link: https://lkml.kernel.org/r/20230930231029.88196-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-01 00:10:29 +01:00
|
|
|
* If the invalidate lock is not held, the folio was in cache
|
|
|
|
* and uptodate and now it is not. Strange but possible since we
|
|
|
|
* didn't hold the page lock all the time. Let's drop
|
|
|
|
* everything, get the invalidate lock and try again.
|
2021-01-28 19:19:45 +01:00
|
|
|
*/
|
|
|
|
if (!mapping_locked) {
|
2021-03-10 10:46:41 -05:00
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2021-01-28 19:19:45 +01:00
|
|
|
goto retry_find;
|
|
|
|
}
|
mm/filemap: clarify filemap_fault() comments for not uptodate case
The existing comments in filemap_fault() suggest that, after either a
minor fault has occurred and filemap_get_folio() found a folio in the page
cache, or a major fault arose and __filemap_get_folio(FGP_CREATE...) did
the job (having relied on do_sync_mmap_readahead() or filemap_read_folio()
to read in the folio), the only possible reason it could not be uptodate
is because of an error.
This is not so, as if, for instance, the fault occurred within a VMA which
had the VM_RAND_READ flag set (via madvise() with the MADV_RANDOM flag
specified), this would cause even synchronous readahead to fail to read in
the folio.
I confirmed this by dropping page caches and faulting in memory
madvise()'d this way, observing that this code path was reached on each
occasion.
Clarify the comments to include this case, and additionally update the
comment recently added around the invalidate lock logic to make it clear
the comment explicitly refers to the minor fault case.
In addition, while we're here, refer to folios rather than pages.
[lstoakes@gmail.com: correct identation as per Christopher's feedback]
Link: https://lkml.kernel.org/r/2c7014c0-6343-4e76-8697-3f84f54350bd@lucifer.local
Link: https://lkml.kernel.org/r/20230930231029.88196-1-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-01 00:10:29 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* OK, the folio is really not uptodate. This can be because the
|
|
|
|
* VMA has the VM_RAND_READ flag set, or because an error
|
|
|
|
* arose. Let's read it in directly.
|
|
|
|
*/
|
2005-04-16 15:20:36 -07:00
|
|
|
goto page_not_uptodate;
|
2021-01-28 19:19:45 +01:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
/*
|
2020-06-08 21:33:54 -07:00
|
|
|
* We've made it this far and we had to drop our mmap_lock, now is the
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
* time to return to the upper layer and have it re-find the vma and
|
|
|
|
* redo the fault.
|
|
|
|
*/
|
|
|
|
if (fpin) {
|
2021-03-10 10:46:41 -05:00
|
|
|
folio_unlock(folio);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
goto out_retry;
|
|
|
|
}
|
2021-01-28 19:19:45 +01:00
|
|
|
if (mapping_locked)
|
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
|
2009-06-16 15:31:25 -07:00
|
|
|
/*
|
|
|
|
* Found the page and have a reference on it.
|
|
|
|
* We must recheck i_size under page lock.
|
|
|
|
*/
|
2021-03-10 10:46:41 -05:00
|
|
|
max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
|
|
|
|
if (unlikely(index >= max_idx)) {
|
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2007-10-31 09:19:46 -07:00
|
|
|
return VM_FAULT_SIGBUS;
|
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:46:57 -07:00
|
|
|
}
|
|
|
|
|
2021-03-10 10:46:41 -05:00
|
|
|
vmf->page = folio_file_page(folio, index);
|
2007-07-19 01:47:05 -07:00
|
|
|
return ret | VM_FAULT_LOCKED;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
page_not_uptodate:
|
|
|
|
/*
|
|
|
|
* Umm, take care of errors if the page isn't up-to-date.
|
|
|
|
* Try to re-read it _once_. We do this synchronously,
|
|
|
|
* because there really aren't any performance issues here
|
|
|
|
* and we need to check for errors.
|
|
|
|
*/
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
|
2022-05-12 17:37:01 -04:00
|
|
|
error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
if (fpin)
|
|
|
|
goto out_retry;
|
2021-03-10 10:46:41 -05:00
|
|
|
folio_put(folio);
|
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:46:57 -07:00
|
|
|
|
|
|
|
if (!error || error == AOP_TRUNCATED_PAGE)
|
2005-12-15 14:28:17 -08:00
|
|
|
goto retry_find;
|
2021-01-28 19:19:45 +01:00
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2007-07-19 01:47:03 -07:00
|
|
|
return VM_FAULT_SIGBUS;
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
|
|
|
|
out_retry:
|
|
|
|
/*
|
2020-06-08 21:33:54 -07:00
|
|
|
* We dropped the mmap_lock, we need to return to the fault handler to
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
* re-find the vma and come back and find our hopefully still populated
|
|
|
|
* page.
|
|
|
|
*/
|
2023-05-06 17:04:14 +01:00
|
|
|
if (!IS_ERR(folio))
|
2021-03-10 10:46:41 -05:00
|
|
|
folio_put(folio);
|
2021-01-28 19:19:45 +01:00
|
|
|
if (mapping_locked)
|
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.
Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.
Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions. Consider some large company that does a
lot of web traffic. This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons. Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur. Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.
This patch also adds a new helper for locking the page with the mmap_sem
dropped. This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked. However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page. This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.
[josef@toxicpanda.com: v6]
Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:22 -07:00
|
|
|
if (fpin)
|
|
|
|
fput(fpin);
|
|
|
|
return ret | VM_FAULT_RETRY;
|
2007-07-19 01:46:59 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_fault);
|
|
|
|
|
2023-01-16 19:39:39 +00:00
|
|
|
static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
|
|
|
|
pgoff_t start)
|
2014-04-07 15:37:19 -07:00
|
|
|
{
|
2020-12-19 15:19:23 +03:00
|
|
|
struct mm_struct *mm = vmf->vma->vm_mm;
|
|
|
|
|
|
|
|
/* Huge page is mapped? No need to proceed. */
|
|
|
|
if (pmd_trans_huge(*vmf->pmd)) {
|
2023-01-16 19:39:39 +00:00
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2020-12-19 15:19:23 +03:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2023-01-16 19:39:39 +00:00
|
|
|
if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
|
|
|
|
struct page *page = folio_file_page(folio, start);
|
mm: filemap: coding style cleanup for filemap_map_pmd()
Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.
When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated
if uncorrectable errors happen. By looking this deeper it turns out
this approach (truncating poisoned page) may incur silent data loss for
all non-readonly filesystems if the page is dirty. It may be worse for
in-memory filesystem, e.g. shmem/tmpfs since the data blocks are
actually gone.
To solve this problem we could keep the poisoned dirty page in page
cache then notify the users on any later access, e.g. page fault,
read/write, etc. The clean page could be truncated as is since they can
be reread from disk later on.
The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault. In
general, we need make the filesystems be aware of poisoned page before
we could keep the poisoned page in page cache in order to solve the data
loss problem.
To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could
prevent from writeback.
- The page should not be dropped (it shows as a clean page) by drop
caches or other callers: the refcount pin from hwpoison could prevent
from invalidating (called by cache drop, inode cache shrinking, etc),
but it doesn't avoid invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it
works as it is.
- Notify users when the page is accessed, e.g. read/write, page fault
and other paths (compression, encryption, etc).
The scope of the last one is huge since almost all filesystems need do
it once a page is returned from page cache lookup. There are a couple
of options to do it:
1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most
callsites check if NULL is returned, this should have least work I
think. But the error handling in filesystems just return -ENOMEM,
the error code will incur confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
but this will involve significant amount of code change as well
since all the paths need check if the pointer is ERR or not just
like option #1.
I did prototypes for both #1 and #3, but it seems #3 may require more
changes than #1. For #3 ERR_PTR will be returned so all the callers
need to check the return value otherwise invalid pointer may be
dereferenced, but not all callers really care about the content of the
page, for example, partial truncate which just sets the truncated range
in one page to 0. So for such paths it needs additional modification if
ERR_PTR is returned. And if the callers have their own way to handle
the problematic pages we need to add a new FGP flag to tell FGP
functions to return the pointer to the page.
It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug. It seems
this problem had been slightly discussed before, but seems no action was
taken at that time. [2]
As the aforementioned investigation, it needs huge amount of work to
solve the potential data loss for all filesystems. But it is much
easier for in-memory filesystems and such filesystems actually suffer
more than others since even the data blocks are gone due to truncating.
So this patchset starts from shmem/tmpfs by taking option #1.
TODO:
* The unpoison has been broken since commit 0ed950d1f281 ("mm,hwpoison: make
get_hwpoison_page() call get_any_page()"), and this patch series make
refcount check for unpoisoning shmem page fail.
* Expand to other filesystems. But I haven't heard feedback from filesystem
developers yet.
Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
add page cache THP split support.
This patch (of 4):
A minor cleanup to the indent.
Link: https://lkml.kernel.org/r/20211020210755.23964-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20211020210755.23964-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 13:41:04 -07:00
|
|
|
vm_fault_t ret = do_set_pmd(vmf, page);
|
|
|
|
if (!ret) {
|
|
|
|
/* The page is mapped successfully, reference consumed. */
|
2023-01-16 19:39:39 +00:00
|
|
|
folio_unlock(folio);
|
mm: filemap: coding style cleanup for filemap_map_pmd()
Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.
When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated
if uncorrectable errors happen. By looking this deeper it turns out
this approach (truncating poisoned page) may incur silent data loss for
all non-readonly filesystems if the page is dirty. It may be worse for
in-memory filesystem, e.g. shmem/tmpfs since the data blocks are
actually gone.
To solve this problem we could keep the poisoned dirty page in page
cache then notify the users on any later access, e.g. page fault,
read/write, etc. The clean page could be truncated as is since they can
be reread from disk later on.
The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault. In
general, we need make the filesystems be aware of poisoned page before
we could keep the poisoned page in page cache in order to solve the data
loss problem.
To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could
prevent from writeback.
- The page should not be dropped (it shows as a clean page) by drop
caches or other callers: the refcount pin from hwpoison could prevent
from invalidating (called by cache drop, inode cache shrinking, etc),
but it doesn't avoid invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it
works as it is.
- Notify users when the page is accessed, e.g. read/write, page fault
and other paths (compression, encryption, etc).
The scope of the last one is huge since almost all filesystems need do
it once a page is returned from page cache lookup. There are a couple
of options to do it:
1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most
callsites check if NULL is returned, this should have least work I
think. But the error handling in filesystems just return -ENOMEM,
the error code will incur confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
but this will involve significant amount of code change as well
since all the paths need check if the pointer is ERR or not just
like option #1.
I did prototypes for both #1 and #3, but it seems #3 may require more
changes than #1. For #3 ERR_PTR will be returned so all the callers
need to check the return value otherwise invalid pointer may be
dereferenced, but not all callers really care about the content of the
page, for example, partial truncate which just sets the truncated range
in one page to 0. So for such paths it needs additional modification if
ERR_PTR is returned. And if the callers have their own way to handle
the problematic pages we need to add a new FGP flag to tell FGP
functions to return the pointer to the page.
It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug. It seems
this problem had been slightly discussed before, but seems no action was
taken at that time. [2]
As the aforementioned investigation, it needs huge amount of work to
solve the potential data loss for all filesystems. But it is much
easier for in-memory filesystems and such filesystems actually suffer
more than others since even the data blocks are gone due to truncating.
So this patchset starts from shmem/tmpfs by taking option #1.
TODO:
* The unpoison has been broken since commit 0ed950d1f281 ("mm,hwpoison: make
get_hwpoison_page() call get_any_page()"), and this patch series make
refcount check for unpoisoning shmem page fail.
* Expand to other filesystems. But I haven't heard feedback from filesystem
developers yet.
Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
add page cache THP split support.
This patch (of 4):
A minor cleanup to the indent.
Link: https://lkml.kernel.org/r/20211020210755.23964-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20211020210755.23964-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 13:41:04 -07:00
|
|
|
return true;
|
2020-12-19 15:19:23 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
mm: fix oops when filemap_map_pmd() without prealloc_pte
syzbot reports oops in lockdep's __lock_acquire(), called from
__pte_offset_map_lock() called from filemap_map_pages(); or when I run the
repro, the oops comes in pmd_install(), called from filemap_map_pmd()
called from filemap_map_pages(), just before the __pte_offset_map_lock().
The problem is that filemap_map_pmd() has been assuming that when it finds
pmd_none(), a page table has already been prepared in prealloc_pte; and
indeed do_fault_around() has been careful to preallocate one there, when
it finds pmd_none(): but what if *pmd became none in between?
My 6.6 mods in mm/khugepaged.c, avoiding mmap_lock for write, have made it
easy for *pmd to be cleared while servicing a page fault; but even before
those, a huge *pmd might be zapped while a fault is serviced.
The difference in symptomatic stack traces comes from the "memory model"
in use: pmd_install() uses pmd_populate() uses page_to_pfn(): in some
models that is strict, and will oops on the NULL prealloc_pte; in other
models, it will construct a bogus value to be populated into *pmd, then
__pte_offset_map_lock() oops when trying to access split ptlock pointer
(or some other symptom in normal case of ptlock embedded not pointer).
Link: https://lore.kernel.org/linux-mm/20231115065506.19780-1-jose.pekkarinen@foxhound.fi/
Link: https://lkml.kernel.org/r/6ed0c50c-78ef-0719-b3c5-60c0c010431c@google.com
Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-and-tested-by: syzbot+89edd67979b52675ddec@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/0000000000005e44550608a0806c@google.com/
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>,
Cc: José Pekkarinen <jose.pekkarinen@foxhound.fi>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-17 00:49:18 -08:00
|
|
|
if (pmd_none(*vmf->pmd) && vmf->prealloc_pte)
|
2021-11-05 13:38:38 -07:00
|
|
|
pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
|
2020-12-19 15:19:23 +03:00
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2023-08-02 16:14:02 +01:00
|
|
|
static struct folio *next_uptodate_folio(struct xa_state *xas,
|
|
|
|
struct address_space *mapping, pgoff_t end_pgoff)
|
2020-12-19 15:19:23 +03:00
|
|
|
{
|
2023-08-02 16:14:02 +01:00
|
|
|
struct folio *folio = xas_next_entry(xas, end_pgoff);
|
2020-12-19 15:19:23 +03:00
|
|
|
unsigned long max_idx;
|
|
|
|
|
|
|
|
do {
|
2021-03-12 23:33:43 -05:00
|
|
|
if (!folio)
|
2020-12-19 15:19:23 +03:00
|
|
|
return NULL;
|
2021-03-12 23:33:43 -05:00
|
|
|
if (xas_retry(xas, folio))
|
2020-12-19 15:19:23 +03:00
|
|
|
continue;
|
2021-03-12 23:33:43 -05:00
|
|
|
if (xa_is_value(folio))
|
2020-12-19 15:19:23 +03:00
|
|
|
continue;
|
2021-03-12 23:33:43 -05:00
|
|
|
if (folio_test_locked(folio))
|
2020-12-19 15:19:23 +03:00
|
|
|
continue;
|
2024-06-25 13:53:50 -07:00
|
|
|
if (!folio_try_get(folio))
|
2020-12-19 15:19:23 +03:00
|
|
|
continue;
|
|
|
|
/* Has the page moved or been split? */
|
2021-03-12 23:33:43 -05:00
|
|
|
if (unlikely(folio != xas_reload(xas)))
|
2020-12-19 15:19:23 +03:00
|
|
|
goto skip;
|
2021-03-12 23:33:43 -05:00
|
|
|
if (!folio_test_uptodate(folio) || folio_test_readahead(folio))
|
2020-12-19 15:19:23 +03:00
|
|
|
goto skip;
|
2021-03-12 23:33:43 -05:00
|
|
|
if (!folio_trylock(folio))
|
2020-12-19 15:19:23 +03:00
|
|
|
goto skip;
|
2021-03-12 23:33:43 -05:00
|
|
|
if (folio->mapping != mapping)
|
2020-12-19 15:19:23 +03:00
|
|
|
goto unlock;
|
2021-03-12 23:33:43 -05:00
|
|
|
if (!folio_test_uptodate(folio))
|
2020-12-19 15:19:23 +03:00
|
|
|
goto unlock;
|
|
|
|
max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
|
|
|
|
if (xas->xa_index >= max_idx)
|
|
|
|
goto unlock;
|
2021-03-12 23:46:45 -05:00
|
|
|
return folio;
|
2020-12-19 15:19:23 +03:00
|
|
|
unlock:
|
2021-03-12 23:33:43 -05:00
|
|
|
folio_unlock(folio);
|
2020-12-19 15:19:23 +03:00
|
|
|
skip:
|
2021-03-12 23:33:43 -05:00
|
|
|
folio_put(folio);
|
|
|
|
} while ((folio = xas_next_entry(xas, end_pgoff)) != NULL);
|
2020-12-19 15:19:23 +03:00
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2023-08-02 16:14:02 +01:00
|
|
|
/*
|
|
|
|
* Map page range [start_page, start_page + nr_pages) of folio.
|
|
|
|
* start_page is gotten from start by folio_page(folio, start)
|
|
|
|
*/
|
|
|
|
static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
|
|
|
|
struct folio *folio, unsigned long start,
|
2023-09-14 21:47:41 +08:00
|
|
|
unsigned long addr, unsigned int nr_pages,
|
2024-04-12 14:47:51 +08:00
|
|
|
unsigned long *rss, unsigned int *mmap_miss)
|
2020-12-19 15:19:23 +03:00
|
|
|
{
|
2023-08-02 16:14:02 +01:00
|
|
|
vm_fault_t ret = 0;
|
|
|
|
struct page *page = folio_page(folio, start);
|
2023-08-02 16:14:05 +01:00
|
|
|
unsigned int count = 0;
|
|
|
|
pte_t *old_ptep = vmf->pte;
|
2020-12-19 15:19:23 +03:00
|
|
|
|
2023-08-02 16:14:02 +01:00
|
|
|
do {
|
2023-08-02 16:14:05 +01:00
|
|
|
if (PageHWPoison(page + count))
|
|
|
|
goto skip;
|
2023-08-02 16:14:02 +01:00
|
|
|
|
2024-03-22 17:35:55 +08:00
|
|
|
/*
|
|
|
|
* If there are too many folios that are recently evicted
|
|
|
|
* in a file, they will probably continue to be evicted.
|
|
|
|
* In such situation, read-ahead is only a waste of IO.
|
|
|
|
* Don't decrease mmap_miss in this scenario to make sure
|
|
|
|
* we can stop read-ahead.
|
|
|
|
*/
|
|
|
|
if (!folio_test_workingset(folio))
|
|
|
|
(*mmap_miss)++;
|
2023-08-02 16:14:02 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* NOTE: If there're PTE markers, we'll leave them to be
|
|
|
|
* handled in the specific fault path, and it'll prohibit the
|
|
|
|
* fault-around logic.
|
|
|
|
*/
|
2023-11-14 15:49:45 +00:00
|
|
|
if (!pte_none(ptep_get(&vmf->pte[count])))
|
2023-08-02 16:14:05 +01:00
|
|
|
goto skip;
|
2023-08-02 16:14:02 +01:00
|
|
|
|
2023-08-02 16:14:05 +01:00
|
|
|
count++;
|
|
|
|
continue;
|
|
|
|
skip:
|
|
|
|
if (count) {
|
|
|
|
set_pte_range(vmf, folio, page, count, addr);
|
2024-04-12 14:47:51 +08:00
|
|
|
*rss += count;
|
2023-08-02 16:14:05 +01:00
|
|
|
folio_ref_add(folio, count);
|
2023-09-20 04:53:35 +01:00
|
|
|
if (in_range(vmf->address, addr, count * PAGE_SIZE))
|
2023-08-02 16:14:05 +01:00
|
|
|
ret = VM_FAULT_NOPAGE;
|
|
|
|
}
|
2023-08-02 16:14:02 +01:00
|
|
|
|
2023-08-02 16:14:05 +01:00
|
|
|
count++;
|
|
|
|
page += count;
|
|
|
|
vmf->pte += count;
|
|
|
|
addr += count * PAGE_SIZE;
|
|
|
|
count = 0;
|
|
|
|
} while (--nr_pages > 0);
|
|
|
|
|
|
|
|
if (count) {
|
|
|
|
set_pte_range(vmf, folio, page, count, addr);
|
2024-04-12 14:47:51 +08:00
|
|
|
*rss += count;
|
2023-08-02 16:14:05 +01:00
|
|
|
folio_ref_add(folio, count);
|
2023-09-20 04:53:35 +01:00
|
|
|
if (in_range(vmf->address, addr, count * PAGE_SIZE))
|
2023-08-02 16:14:05 +01:00
|
|
|
ret = VM_FAULT_NOPAGE;
|
|
|
|
}
|
2023-08-02 16:14:02 +01:00
|
|
|
|
2023-08-02 16:14:05 +01:00
|
|
|
vmf->pte = old_ptep;
|
2023-09-14 21:47:41 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
|
|
|
|
struct folio *folio, unsigned long addr,
|
2024-04-12 14:47:51 +08:00
|
|
|
unsigned long *rss, unsigned int *mmap_miss)
|
2023-09-14 21:47:41 +08:00
|
|
|
{
|
|
|
|
vm_fault_t ret = 0;
|
|
|
|
struct page *page = &folio->page;
|
|
|
|
|
|
|
|
if (PageHWPoison(page))
|
|
|
|
return ret;
|
|
|
|
|
2024-03-22 17:35:55 +08:00
|
|
|
/* See comment of filemap_map_folio_range() */
|
|
|
|
if (!folio_test_workingset(folio))
|
|
|
|
(*mmap_miss)++;
|
2023-09-14 21:47:41 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* NOTE: If there're PTE markers, we'll leave them to be
|
|
|
|
* handled in the specific fault path, and it'll prohibit
|
|
|
|
* the fault-around logic.
|
|
|
|
*/
|
|
|
|
if (!pte_none(ptep_get(vmf->pte)))
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (vmf->address == addr)
|
|
|
|
ret = VM_FAULT_NOPAGE;
|
|
|
|
|
|
|
|
set_pte_range(vmf, folio, page, 1, addr);
|
2024-04-12 14:47:51 +08:00
|
|
|
(*rss)++;
|
2023-09-14 21:47:41 +08:00
|
|
|
folio_ref_inc(folio);
|
2023-08-02 16:14:02 +01:00
|
|
|
|
|
|
|
return ret;
|
2020-12-19 15:19:23 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
vm_fault_t filemap_map_pages(struct vm_fault *vmf,
|
|
|
|
pgoff_t start_pgoff, pgoff_t end_pgoff)
|
|
|
|
{
|
|
|
|
struct vm_area_struct *vma = vmf->vma;
|
|
|
|
struct file *file = vma->vm_file;
|
2014-04-07 15:37:19 -07:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
2024-08-22 15:50:13 +02:00
|
|
|
pgoff_t file_end, last_pgoff = start_pgoff;
|
2021-01-14 15:24:19 +00:00
|
|
|
unsigned long addr;
|
2018-05-17 00:08:30 -04:00
|
|
|
XA_STATE(xas, &mapping->i_pages, start_pgoff);
|
2021-03-12 23:46:45 -05:00
|
|
|
struct folio *folio;
|
2020-12-19 15:19:23 +03:00
|
|
|
vm_fault_t ret = 0;
|
2024-04-12 14:47:51 +08:00
|
|
|
unsigned long rss = 0;
|
|
|
|
unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
|
2014-04-07 15:37:19 -07:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2023-08-02 16:14:02 +01:00
|
|
|
folio = next_uptodate_folio(&xas, mapping, end_pgoff);
|
2021-03-12 23:46:45 -05:00
|
|
|
if (!folio)
|
2020-12-19 15:19:23 +03:00
|
|
|
goto out;
|
2014-04-07 15:37:19 -07:00
|
|
|
|
2023-01-16 19:39:39 +00:00
|
|
|
if (filemap_map_pmd(vmf, folio, start_pgoff)) {
|
2020-12-19 15:19:23 +03:00
|
|
|
ret = VM_FAULT_NOPAGE;
|
|
|
|
goto out;
|
|
|
|
}
|
2014-04-07 15:37:19 -07:00
|
|
|
|
2021-01-14 15:24:19 +00:00
|
|
|
addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
|
|
|
|
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
|
2023-06-08 18:11:29 -07:00
|
|
|
if (!vmf->pte) {
|
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
|
|
|
goto out;
|
|
|
|
}
|
2024-04-12 14:47:51 +08:00
|
|
|
|
2024-08-22 15:50:13 +02:00
|
|
|
file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
|
|
|
|
if (end_pgoff > file_end)
|
|
|
|
end_pgoff = file_end;
|
|
|
|
|
2024-04-12 14:47:51 +08:00
|
|
|
folio_type = mm_counter_file(folio);
|
2020-12-19 15:19:23 +03:00
|
|
|
do {
|
2023-08-02 16:14:02 +01:00
|
|
|
unsigned long end;
|
2016-07-26 15:25:23 -07:00
|
|
|
|
2021-01-14 15:24:19 +00:00
|
|
|
addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
|
2020-12-19 15:19:23 +03:00
|
|
|
vmf->pte += xas.xa_index - last_pgoff;
|
2018-05-17 00:08:30 -04:00
|
|
|
last_pgoff = xas.xa_index;
|
2023-09-21 16:15:35 +08:00
|
|
|
end = folio_next_index(folio) - 1;
|
2023-08-02 16:14:02 +01:00
|
|
|
nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
|
2020-12-19 15:19:23 +03:00
|
|
|
|
2023-09-14 21:47:41 +08:00
|
|
|
if (!folio_test_large(folio))
|
|
|
|
ret |= filemap_map_order0_folio(vmf,
|
2024-04-12 14:47:51 +08:00
|
|
|
folio, addr, &rss, &mmap_miss);
|
2023-09-14 21:47:41 +08:00
|
|
|
else
|
|
|
|
ret |= filemap_map_folio_range(vmf, folio,
|
|
|
|
xas.xa_index - folio->index, addr,
|
2024-04-12 14:47:51 +08:00
|
|
|
nr_pages, &rss, &mmap_miss);
|
2020-11-24 18:48:26 +00:00
|
|
|
|
2021-03-12 23:46:45 -05:00
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2023-09-14 21:47:41 +08:00
|
|
|
} while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
|
2024-04-12 14:47:51 +08:00
|
|
|
add_mm_counter(vma->vm_mm, folio_type, rss);
|
2020-12-19 15:19:23 +03:00
|
|
|
pte_unmap_unlock(vmf->pte, vmf->ptl);
|
filemap: add trace events for get_pages, map_pages, and fault
To allow precise tracking of page caches accessed, add new tracepoints
that trigger when a process actually accesses them.
The ureadahead program used by ChromeOS traces the disk access of programs
as they start up at boot up. It uses mincore(2) or the
'mm_filemap_add_to_page_cache' trace event to accomplish this. It stores
this information in a "pack" file and on subsequent boots, it will read
the pack file and call readahead(2) on the information so that disk
storage can be loaded into RAM before the applications actually need it.
A problem we see is that due to the kernel's readahead algorithm that can
aggressively pull in more data than needed (to try and accomplish the same
goal) and this data is also recorded. The end result is that the pack
file contains a lot of pages on disk that are never actually used.
Calling readahead(2) on these unused pages can slow down the system boot
up times.
To solve this, add 3 new trace events, get_pages, map_pages, and fault.
These will be used to trace the pages are not only pulled in from disk,
but are actually used by the application. Only those pages will be stored
in the pack file, and this helps out the performance of boot up.
With the combination of these 3 new trace events and
mm_filemap_add_to_page_cache, we observed a reduction in the pack file by
7.3% - 20% on ChromeOS varying by device.
Link: https://lkml.kernel.org/r/20240813100312.3930505-1-takayas@chromium.org
Signed-off-by: Takaya Saeki <takayas@chromium.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Junichi Uekawa <uekawa@chromium.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-13 10:03:12 +00:00
|
|
|
trace_mm_filemap_map_pages(mapping, start_pgoff, end_pgoff);
|
2020-12-19 15:19:23 +03:00
|
|
|
out:
|
2014-04-07 15:37:19 -07:00
|
|
|
rcu_read_unlock();
|
2023-09-14 21:47:41 +08:00
|
|
|
|
|
|
|
mmap_miss_saved = READ_ONCE(file->f_ra.mmap_miss);
|
|
|
|
if (mmap_miss >= mmap_miss_saved)
|
|
|
|
WRITE_ONCE(file->f_ra.mmap_miss, 0);
|
|
|
|
else
|
|
|
|
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss_saved - mmap_miss);
|
|
|
|
|
2020-12-19 15:19:23 +03:00
|
|
|
return ret;
|
2014-04-07 15:37:19 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(filemap_map_pages);
|
|
|
|
|
2018-06-07 17:08:00 -07:00
|
|
|
vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf)
|
2012-06-12 16:20:29 +02:00
|
|
|
{
|
2020-11-16 14:33:37 +01:00
|
|
|
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
|
2021-03-12 23:57:44 -05:00
|
|
|
struct folio *folio = page_folio(vmf->page);
|
2018-06-07 17:08:00 -07:00
|
|
|
vm_fault_t ret = VM_FAULT_LOCKED;
|
2012-06-12 16:20:29 +02:00
|
|
|
|
2020-11-16 14:33:37 +01:00
|
|
|
sb_start_pagefault(mapping->host->i_sb);
|
2017-02-24 14:56:41 -08:00
|
|
|
file_update_time(vmf->vma->vm_file);
|
2021-03-12 23:57:44 -05:00
|
|
|
folio_lock(folio);
|
|
|
|
if (folio->mapping != mapping) {
|
|
|
|
folio_unlock(folio);
|
2012-06-12 16:20:29 +02:00
|
|
|
ret = VM_FAULT_NOPAGE;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-06-12 16:20:37 +02:00
|
|
|
/*
|
2021-03-12 23:57:44 -05:00
|
|
|
* We mark the folio dirty already here so that when freeze is in
|
2012-06-12 16:20:37 +02:00
|
|
|
* progress, we are guaranteed that writeback during freezing will
|
2021-03-12 23:57:44 -05:00
|
|
|
* see the dirty folio and writeprotect it again.
|
2012-06-12 16:20:37 +02:00
|
|
|
*/
|
2021-03-12 23:57:44 -05:00
|
|
|
folio_mark_dirty(folio);
|
|
|
|
folio_wait_stable(folio);
|
2012-06-12 16:20:29 +02:00
|
|
|
out:
|
2020-11-16 14:33:37 +01:00
|
|
|
sb_end_pagefault(mapping->host->i_sb);
|
2012-06-12 16:20:29 +02:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-09-27 22:29:37 +04:00
|
|
|
const struct vm_operations_struct generic_file_vm_ops = {
|
2007-07-19 01:46:59 -07:00
|
|
|
.fault = filemap_fault,
|
2014-04-07 15:37:19 -07:00
|
|
|
.map_pages = filemap_map_pages,
|
2012-06-12 16:20:29 +02:00
|
|
|
.page_mkwrite = filemap_page_mkwrite,
|
2005-04-16 15:20:36 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
/* This is used for a general mmap of a disk file */
|
|
|
|
|
2021-05-04 18:40:12 -07:00
|
|
|
int generic_file_mmap(struct file *file, struct vm_area_struct *vma)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
|
2022-04-29 11:53:28 -04:00
|
|
|
if (!mapping->a_ops->read_folio)
|
2005-04-16 15:20:36 -07:00
|
|
|
return -ENOEXEC;
|
|
|
|
file_accessed(file);
|
|
|
|
vma->vm_ops = &generic_file_vm_ops;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is for filesystems which do not implement ->writepage.
|
|
|
|
*/
|
|
|
|
int generic_file_readonly_mmap(struct file *file, struct vm_area_struct *vma)
|
|
|
|
{
|
mm: drop the assumption that VM_SHARED always implies writable
Patch series "permit write-sealed memfd read-only shared mappings", v4.
The man page for fcntl() describing memfd file seals states the following
about F_SEAL_WRITE:-
Furthermore, trying to create new shared, writable memory-mappings via
mmap(2) will also fail with EPERM.
With emphasis on 'writable'. In turns out in fact that currently the
kernel simply disallows all new shared memory mappings for a memfd with
F_SEAL_WRITE applied, rendering this documentation inaccurate.
This matters because users are therefore unable to obtain a shared mapping
to a memfd after write sealing altogether, which limits their usefulness.
This was reported in the discussion thread [1] originating from a bug
report [2].
This is a product of both using the struct address_space->i_mmap_writable
atomic counter to determine whether writing may be permitted, and the
kernel adjusting this counter when any VM_SHARED mapping is performed and
more generally implicitly assuming VM_SHARED implies writable.
It seems sensible that we should only update this mapping if VM_MAYWRITE
is specified, i.e. whether it is possible that this mapping could at any
point be written to.
If we do so then all we need to do to permit write seals to function as
documented is to clear VM_MAYWRITE when mapping read-only. It turns out
this functionality already exists for F_SEAL_FUTURE_WRITE - we can
therefore simply adapt this logic to do the same for F_SEAL_WRITE.
We then hit a chicken and egg situation in mmap_region() where the check
for VM_MAYWRITE occurs before we are able to clear this flag. To work
around this, perform this check after we invoke call_mmap(), with careful
consideration of error paths.
Thanks to Andy Lutomirski for the suggestion!
[1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/
[2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238
This patch (of 3):
There is a general assumption that VMAs with the VM_SHARED flag set are
writable. If the VM_MAYWRITE flag is not set, then this is simply not the
case.
Update those checks which affect the struct address_space->i_mmap_writable
field to explicitly test for this by introducing
[vma_]is_shared_maywrite() helper functions.
This remains entirely conservative, as the lack of VM_MAYWRITE guarantees
that the VMA cannot be written to.
Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com
Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-12 18:04:28 +01:00
|
|
|
if (vma_is_shared_maywrite(vma))
|
2005-04-16 15:20:36 -07:00
|
|
|
return -EINVAL;
|
|
|
|
return generic_file_mmap(file, vma);
|
|
|
|
}
|
|
|
|
#else
|
2018-10-26 15:04:03 -07:00
|
|
|
vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf)
|
2018-04-13 15:35:27 -07:00
|
|
|
{
|
2018-10-26 15:04:03 -07:00
|
|
|
return VM_FAULT_SIGBUS;
|
2018-04-13 15:35:27 -07:00
|
|
|
}
|
2021-05-04 18:40:12 -07:00
|
|
|
int generic_file_mmap(struct file *file, struct vm_area_struct *vma)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
2021-05-04 18:40:12 -07:00
|
|
|
int generic_file_readonly_mmap(struct file *file, struct vm_area_struct *vma)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_MMU */
|
|
|
|
|
2018-04-13 15:35:27 -07:00
|
|
|
EXPORT_SYMBOL(filemap_page_mkwrite);
|
2005-04-16 15:20:36 -07:00
|
|
|
EXPORT_SYMBOL(generic_file_mmap);
|
|
|
|
EXPORT_SYMBOL(generic_file_readonly_mmap);
|
|
|
|
|
2020-12-16 11:45:30 -05:00
|
|
|
static struct folio *do_read_cache_folio(struct address_space *mapping,
|
2022-05-01 21:39:29 -04:00
|
|
|
pgoff_t index, filler_t filler, struct file *file, gfp_t gfp)
|
2014-04-03 14:48:18 -07:00
|
|
|
{
|
2020-12-16 11:45:30 -05:00
|
|
|
struct folio *folio;
|
2005-04-16 15:20:36 -07:00
|
|
|
int err;
|
2022-05-08 15:07:11 -04:00
|
|
|
|
|
|
|
if (!filler)
|
|
|
|
filler = mapping->a_ops->read_folio;
|
2005-04-16 15:20:36 -07:00
|
|
|
repeat:
|
2020-12-16 11:45:30 -05:00
|
|
|
folio = filemap_get_folio(mapping, index);
|
2023-03-07 15:34:10 +01:00
|
|
|
if (IS_ERR(folio)) {
|
2024-08-22 15:50:10 +02:00
|
|
|
folio = filemap_alloc_folio(gfp,
|
|
|
|
mapping_min_folio_order(mapping));
|
2020-12-16 11:45:30 -05:00
|
|
|
if (!folio)
|
2007-10-16 01:24:57 -07:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2024-08-22 15:50:10 +02:00
|
|
|
index = mapping_align_index(mapping, index);
|
2020-12-16 11:45:30 -05:00
|
|
|
err = filemap_add_folio(mapping, folio, index, gfp);
|
2007-10-16 01:24:57 -07:00
|
|
|
if (unlikely(err)) {
|
2020-12-16 11:45:30 -05:00
|
|
|
folio_put(folio);
|
2007-10-16 01:24:57 -07:00
|
|
|
if (err == -EEXIST)
|
|
|
|
goto repeat;
|
2017-12-04 04:02:00 -05:00
|
|
|
/* Presumably ENOMEM for xarray node */
|
2005-04-16 15:20:36 -07:00
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
2016-03-15 14:55:36 -07:00
|
|
|
|
2022-05-12 17:12:21 -04:00
|
|
|
goto filler;
|
2016-03-15 14:55:36 -07:00
|
|
|
}
|
2020-12-16 11:45:30 -05:00
|
|
|
if (folio_test_uptodate(folio))
|
2005-04-16 15:20:36 -07:00
|
|
|
goto out;
|
|
|
|
|
2021-12-23 15:17:28 -05:00
|
|
|
if (!folio_trylock(folio)) {
|
|
|
|
folio_put_wait_locked(folio, TASK_UNINTERRUPTIBLE);
|
|
|
|
goto repeat;
|
|
|
|
}
|
2016-03-15 14:55:39 -07:00
|
|
|
|
2021-12-23 15:17:28 -05:00
|
|
|
/* Folio was truncated from mapping */
|
2020-12-16 11:45:30 -05:00
|
|
|
if (!folio->mapping) {
|
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2016-03-15 14:55:36 -07:00
|
|
|
goto repeat;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2016-03-15 14:55:39 -07:00
|
|
|
|
|
|
|
/* Someone else locked and filled the page in a very small window */
|
2020-12-16 11:45:30 -05:00
|
|
|
if (folio_test_uptodate(folio)) {
|
|
|
|
folio_unlock(folio);
|
2005-04-16 15:20:36 -07:00
|
|
|
goto out;
|
|
|
|
}
|
mm/filemap.c: clear page error before actual read
Mount failure issue happens under the scenario: Application forked dozens
of threads to mount the same number of cramfs images separately in docker,
but several mounts failed with high probability. Mount failed due to the
checking result of the page(read from the superblock of loop dev) is not
uptodate after wait_on_page_locked(page) returned in function cramfs_read:
wait_on_page_locked(page);
if (!PageUptodate(page)) {
...
}
The reason of the checking result of the page not uptodate: systemd-udevd
read the loopX dev before mount, because the status of loopX is Lo_unbound
at this time, so loop_make_request directly trigger the calling of io_end
handler end_buffer_async_read, which called SetPageError(page). So It
caused the page can't be set to uptodate in function
end_buffer_async_read:
if(page_uptodate && !PageError(page)) {
SetPageUptodate(page);
}
Then mount operation is performed, it used the same page which is just
accessed by systemd-udevd above, Because this page is not uptodate, it
will launch a actual read via submit_bh, then wait on this page by calling
wait_on_page_locked(page). When the I/O of the page done, io_end handler
end_buffer_async_read is called, because no one cleared the page
error(during the whole read path of mount), which is caused by
systemd-udevd reading, so this page is still in "PageError" status, which
can't be set to uptodate in function end_buffer_async_read, then caused
mount failure.
But sometimes mount succeed even through systemd-udeved read loopX dev
just before, The reason is systemd-udevd launched other loopX read just
between step 3.1 and 3.2, the steps as below:
1, loopX dev default status is Lo_unbound;
2, systemd-udved read loopX dev (page is set to PageError);
3, mount operation
1) set loopX status to Lo_bound;
==>systemd-udevd read loopX dev<==
2) read loopX dev(page has no error)
3) mount succeed
As the loopX dev status is set to Lo_bound after step 3.1, so the other
loopX dev read by systemd-udevd will go through the whole I/O stack, part
of the call trace as below:
SYS_read
vfs_read
do_sync_read
blkdev_aio_read
generic_file_aio_read
do_generic_file_read:
ClearPageError(page);
mapping->a_ops->readpage(filp, page);
here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel,
some function name changed, the call trace as below:
blkdev_read_iter
generic_file_read_iter
generic_file_buffered_read:
/*
* A previous I/O error may have been due to temporary
* failures, eg. mutipath errors.
* Pg_error will be set again if readpage fails.
*/
ClearPageError(page);
/* Start the actual read. The read will unlock the page*/
error=mapping->a_ops->readpage(flip, page);
We can see ClearPageError(page) is called before the actual read,
then the read in step 3.2 succeed.
This patch is to add the calling of ClearPageError just before the actual
read of read path of cramfs mount. Without the patch, the call trace as
below when performing cramfs mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page
do_read_cache_page:
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, the call trace as below when performing mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page:
do_read_cache_page:
ClearPageError(page); <== new add
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, mount operation trigger the calling of
ClearPageError(page) before the actual read, the page has no error if no
additional page error happen when I/O done.
Signed-off-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: <yubin@h3c.com>
Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-01 21:04:47 -07:00
|
|
|
|
2022-05-12 17:12:21 -04:00
|
|
|
filler:
|
2022-05-12 17:37:01 -04:00
|
|
|
err = filemap_read_folio(file, filler, folio);
|
2022-05-12 17:47:06 -04:00
|
|
|
if (err) {
|
2022-05-12 17:12:21 -04:00
|
|
|
folio_put(folio);
|
2022-05-12 17:47:06 -04:00
|
|
|
if (err == AOP_TRUNCATED_PAGE)
|
|
|
|
goto repeat;
|
2022-05-12 17:12:21 -04:00
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
2016-03-15 14:55:36 -07:00
|
|
|
|
2007-05-09 13:42:20 +01:00
|
|
|
out:
|
2020-12-16 11:45:30 -05:00
|
|
|
folio_mark_accessed(folio);
|
|
|
|
return folio;
|
2007-05-06 14:49:04 -07:00
|
|
|
}
|
2010-01-27 09:20:03 -08:00
|
|
|
|
|
|
|
/**
|
2022-05-01 21:39:29 -04:00
|
|
|
* read_cache_folio - Read into page cache, fill it if needed.
|
|
|
|
* @mapping: The address_space to read from.
|
|
|
|
* @index: The index to read.
|
|
|
|
* @filler: Function to perform the read, or NULL to use aops->read_folio().
|
|
|
|
* @file: Passed to filler function, may be NULL if not required.
|
2010-01-27 09:20:03 -08:00
|
|
|
*
|
2022-05-01 21:39:29 -04:00
|
|
|
* Read one page into the page cache. If it succeeds, the folio returned
|
|
|
|
* will contain @index, but it may not be the first page of the folio.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
2022-05-01 21:39:29 -04:00
|
|
|
* If the filler function returns an error, it will be returned to the
|
|
|
|
* caller.
|
2021-01-28 19:19:45 +01:00
|
|
|
*
|
2022-05-01 21:39:29 -04:00
|
|
|
* Context: May sleep. Expects mapping->invalidate_lock to be held.
|
|
|
|
* Return: An uptodate folio on success, ERR_PTR() on failure.
|
2010-01-27 09:20:03 -08:00
|
|
|
*/
|
2020-12-16 11:45:30 -05:00
|
|
|
struct folio *read_cache_folio(struct address_space *mapping, pgoff_t index,
|
2022-05-01 21:39:29 -04:00
|
|
|
filler_t filler, struct file *file)
|
2020-12-16 11:45:30 -05:00
|
|
|
{
|
2022-05-01 21:39:29 -04:00
|
|
|
return do_read_cache_folio(mapping, index, filler, file,
|
2020-12-16 11:45:30 -05:00
|
|
|
mapping_gfp_mask(mapping));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(read_cache_folio);
|
|
|
|
|
2023-02-06 16:25:19 +00:00
|
|
|
/**
|
|
|
|
* mapping_read_folio_gfp - Read into page cache, using specified allocation flags.
|
|
|
|
* @mapping: The address_space for the folio.
|
|
|
|
* @index: The index that the allocated folio will contain.
|
|
|
|
* @gfp: The page allocator flags to use if allocating.
|
|
|
|
*
|
|
|
|
* This is the same as "read_cache_folio(mapping, index, NULL, NULL)", but with
|
|
|
|
* any new memory allocations done using the specified allocation flags.
|
|
|
|
*
|
|
|
|
* The most likely error from this function is EIO, but ENOMEM is
|
|
|
|
* possible and so is EINTR. If ->read_folio returns another error,
|
|
|
|
* that will be returned to the caller.
|
|
|
|
*
|
|
|
|
* The function expects mapping->invalidate_lock to be already held.
|
|
|
|
*
|
|
|
|
* Return: Uptodate folio on success, ERR_PTR() on failure.
|
|
|
|
*/
|
|
|
|
struct folio *mapping_read_folio_gfp(struct address_space *mapping,
|
|
|
|
pgoff_t index, gfp_t gfp)
|
|
|
|
{
|
|
|
|
return do_read_cache_folio(mapping, index, NULL, NULL, gfp);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(mapping_read_folio_gfp);
|
|
|
|
|
2020-12-16 11:45:30 -05:00
|
|
|
static struct page *do_read_cache_page(struct address_space *mapping,
|
2022-05-01 21:39:29 -04:00
|
|
|
pgoff_t index, filler_t *filler, struct file *file, gfp_t gfp)
|
2020-12-16 11:45:30 -05:00
|
|
|
{
|
|
|
|
struct folio *folio;
|
|
|
|
|
2022-05-01 21:39:29 -04:00
|
|
|
folio = do_read_cache_folio(mapping, index, filler, file, gfp);
|
2020-12-16 11:45:30 -05:00
|
|
|
if (IS_ERR(folio))
|
|
|
|
return &folio->page;
|
|
|
|
return folio_file_page(folio, index);
|
|
|
|
}
|
|
|
|
|
2014-04-03 14:48:18 -07:00
|
|
|
struct page *read_cache_page(struct address_space *mapping,
|
2022-05-01 21:39:29 -04:00
|
|
|
pgoff_t index, filler_t *filler, struct file *file)
|
2010-01-27 09:20:03 -08:00
|
|
|
{
|
2022-05-01 21:39:29 -04:00
|
|
|
return do_read_cache_page(mapping, index, filler, file,
|
2019-07-11 20:55:17 -07:00
|
|
|
mapping_gfp_mask(mapping));
|
2010-01-27 09:20:03 -08:00
|
|
|
}
|
2014-04-03 14:48:18 -07:00
|
|
|
EXPORT_SYMBOL(read_cache_page);
|
2010-01-27 09:20:03 -08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* read_cache_page_gfp - read into page cache, using specified page allocation flags.
|
|
|
|
* @mapping: the page's address_space
|
|
|
|
* @index: the page index
|
|
|
|
* @gfp: the page allocator flags to use if allocating
|
|
|
|
*
|
|
|
|
* This is the same as "read_mapping_page(mapping, index, NULL)", but with
|
2011-12-21 11:05:48 -06:00
|
|
|
* any new page allocations done using the specified allocation flags.
|
2010-01-27 09:20:03 -08:00
|
|
|
*
|
|
|
|
* If the page does not get brought uptodate, return -EIO.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
2021-01-28 19:19:45 +01:00
|
|
|
* The function expects mapping->invalidate_lock to be already held.
|
|
|
|
*
|
2019-03-05 15:48:42 -08:00
|
|
|
* Return: up to date page on success, ERR_PTR() on failure.
|
2010-01-27 09:20:03 -08:00
|
|
|
*/
|
|
|
|
struct page *read_cache_page_gfp(struct address_space *mapping,
|
|
|
|
pgoff_t index,
|
|
|
|
gfp_t gfp)
|
|
|
|
{
|
2019-07-11 20:55:20 -07:00
|
|
|
return do_read_cache_page(mapping, index, NULL, NULL, gfp);
|
2010-01-27 09:20:03 -08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(read_cache_page_gfp);
|
|
|
|
|
2019-11-30 17:49:44 -08:00
|
|
|
/*
|
|
|
|
* Warn about a page cache invalidation failure during a direct I/O write.
|
|
|
|
*/
|
2023-06-01 16:58:58 +02:00
|
|
|
static void dio_warn_stale_pagecache(struct file *filp)
|
2019-11-30 17:49:44 -08:00
|
|
|
{
|
|
|
|
static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST);
|
|
|
|
char pathname[128];
|
|
|
|
char *path;
|
|
|
|
|
2020-11-16 14:33:37 +01:00
|
|
|
errseq_set(&filp->f_mapping->wb_err, -EIO);
|
2019-11-30 17:49:44 -08:00
|
|
|
if (__ratelimit(&_rs)) {
|
|
|
|
path = file_path(filp, pathname, sizeof(pathname));
|
|
|
|
if (IS_ERR(path))
|
|
|
|
path = "(unknown)";
|
|
|
|
pr_crit("Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!\n");
|
|
|
|
pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid,
|
|
|
|
current->comm);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-06-01 16:58:58 +02:00
|
|
|
void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2023-06-01 16:58:58 +02:00
|
|
|
struct address_space *mapping = iocb->ki_filp->f_mapping;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-06-01 16:58:58 +02:00
|
|
|
if (mapping->nrpages &&
|
|
|
|
invalidate_inode_pages2_range(mapping,
|
|
|
|
iocb->ki_pos >> PAGE_SHIFT,
|
|
|
|
(iocb->ki_pos + count - 1) >> PAGE_SHIFT))
|
|
|
|
dio_warn_stale_pagecache(iocb->ki_filp);
|
|
|
|
}
|
2008-07-23 21:27:04 -07:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t
|
2016-04-07 08:51:56 -07:00
|
|
|
generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2023-06-01 16:58:58 +02:00
|
|
|
struct address_space *mapping = iocb->ki_filp->f_mapping;
|
|
|
|
size_t write_len = iov_iter_count(from);
|
|
|
|
ssize_t written;
|
2008-07-23 21:27:04 -07:00
|
|
|
|
fs: fix data invalidation in the cleancache during direct IO
Patch series "Properly invalidate data in the cleancache", v2.
We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache. The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.
Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well. So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.
- Patch 1 fixes direct IO writes by removing ->nrpages check.
- Patch 2 fixes similar case in invalidate_bdev().
Note: I only fixed conditional cleancache_invalidate_inode() here.
Do we also need to add ->nrexceptional check in into invalidate_bdev()?
- Patches 3-4: some optimizations.
This patch (of 4):
Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call. So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.
Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.
Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.
Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.
Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 14:55:59 -07:00
|
|
|
/*
|
|
|
|
* If a page can not be invalidated, return 0 to fall back
|
|
|
|
* to buffered write.
|
|
|
|
*/
|
2023-06-01 16:58:57 +02:00
|
|
|
written = kiocb_invalidate_pages(iocb, write_len);
|
fs: fix data invalidation in the cleancache during direct IO
Patch series "Properly invalidate data in the cleancache", v2.
We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache. The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.
Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well. So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.
- Patch 1 fixes direct IO writes by removing ->nrpages check.
- Patch 2 fixes similar case in invalidate_bdev().
Note: I only fixed conditional cleancache_invalidate_inode() here.
Do we also need to add ->nrexceptional check in into invalidate_bdev()?
- Patches 3-4: some optimizations.
This patch (of 4):
Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call. So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.
Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.
Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.
Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.
Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 14:55:59 -07:00
|
|
|
if (written) {
|
|
|
|
if (written == -EBUSY)
|
|
|
|
return 0;
|
2023-06-01 16:58:58 +02:00
|
|
|
return written;
|
2008-07-23 21:27:04 -07:00
|
|
|
}
|
|
|
|
|
2017-04-13 14:10:15 -04:00
|
|
|
written = mapping->a_ops->direct_IO(iocb, from);
|
2008-07-23 21:27:04 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Finally, try again to invalidate clean pages which might have been
|
|
|
|
* cached by non-direct readahead, or faulted in by get_user_pages()
|
|
|
|
* if the source of the write was an mmap'ed region of the file
|
|
|
|
* we're writing. Either one is a pretty crazy thing to do,
|
|
|
|
* so we don't support it 100%. If this invalidation
|
|
|
|
* fails, tough, the write still worked...
|
2017-09-21 08:16:29 -06:00
|
|
|
*
|
|
|
|
* Most of the time we do not need this since dio_complete() will do
|
|
|
|
* the invalidation for us. However there are some file systems that
|
|
|
|
* do not end up with dio_complete() being called, so let's not break
|
2019-11-30 17:49:41 -08:00
|
|
|
* them by removing it completely.
|
|
|
|
*
|
2019-11-30 17:49:47 -08:00
|
|
|
* Noticeable example is a blkdev_direct_IO().
|
|
|
|
*
|
2019-11-30 17:49:41 -08:00
|
|
|
* Skip invalidation for async writes or if mapping has no pages.
|
2008-07-23 21:27:04 -07:00
|
|
|
*/
|
2005-04-16 15:20:36 -07:00
|
|
|
if (written > 0) {
|
2023-06-01 16:58:58 +02:00
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
loff_t pos = iocb->ki_pos;
|
|
|
|
|
|
|
|
kiocb_invalidate_post_direct_write(iocb, written);
|
2010-10-26 14:21:58 -07:00
|
|
|
pos += written;
|
2017-04-13 14:10:15 -04:00
|
|
|
write_len -= written;
|
2010-10-26 14:21:58 -07:00
|
|
|
if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
|
|
|
|
i_size_write(inode, pos);
|
2005-04-16 15:20:36 -07:00
|
|
|
mark_inode_dirty(inode);
|
|
|
|
}
|
2014-02-11 20:58:20 -05:00
|
|
|
iocb->ki_pos = pos;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2021-02-24 12:01:45 -08:00
|
|
|
if (written != -EIOCBQUEUED)
|
|
|
|
iov_iter_revert(from, write_len - iov_iter_count(from));
|
2005-04-16 15:20:36 -07:00
|
|
|
return written;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(generic_file_direct_write);
|
|
|
|
|
2022-02-19 23:19:49 -05:00
|
|
|
ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
|
2007-10-16 01:25:01 -07:00
|
|
|
{
|
2022-02-19 23:19:49 -05:00
|
|
|
struct file *file = iocb->ki_filp;
|
|
|
|
loff_t pos = iocb->ki_pos;
|
2007-10-16 01:25:01 -07:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
|
|
|
const struct address_space_operations *a_ops = mapping->a_ops;
|
2024-05-27 18:36:08 +02:00
|
|
|
size_t chunk = mapping_max_folio_size(mapping);
|
2007-10-16 01:25:01 -07:00
|
|
|
long status = 0;
|
|
|
|
ssize_t written = 0;
|
2007-10-16 01:25:03 -07:00
|
|
|
|
2007-10-16 01:25:01 -07:00
|
|
|
do {
|
2024-05-27 18:36:08 +02:00
|
|
|
struct folio *folio;
|
|
|
|
size_t offset; /* Offset into folio */
|
|
|
|
size_t bytes; /* Bytes to write to folio */
|
2007-10-16 01:25:01 -07:00
|
|
|
size_t copied; /* Bytes copied from user */
|
2022-09-15 17:04:16 +02:00
|
|
|
void *fsdata = NULL;
|
2007-10-16 01:25:01 -07:00
|
|
|
|
2024-05-27 18:36:08 +02:00
|
|
|
bytes = iov_iter_count(i);
|
|
|
|
retry:
|
|
|
|
offset = pos & (chunk - 1);
|
|
|
|
bytes = min(chunk - offset, bytes);
|
|
|
|
balance_dirty_pages_ratelimited(mapping);
|
2007-10-16 01:25:01 -07:00
|
|
|
|
2015-10-07 08:32:38 +01:00
|
|
|
/*
|
|
|
|
* Bring in the user page that we will copy from _first_.
|
|
|
|
* Otherwise there's a nasty deadlock on copying from the
|
|
|
|
* same page as we're writing to, without it being marked
|
|
|
|
* up-to-date.
|
|
|
|
*/
|
2021-11-09 12:56:06 +01:00
|
|
|
if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
|
2015-10-07 08:32:38 +01:00
|
|
|
status = -EFAULT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
mm: make sendfile(2) killable
Currently a simple program below issues a sendfile(2) system call which
takes about 62 days to complete in my test KVM instance.
int fd;
off_t off = 0;
fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644);
ftruncate(fd, 2);
lseek(fd, 0, SEEK_END);
sendfile(fd, fd, &off, 0xfffffff);
Now you should not ask kernel to do a stupid stuff like copying 256MB in
2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin
should have a way to stop you.
We actually do have a check for fatal_signal_pending() in
generic_perform_write() which triggers in this path however because we
always succeed in writing something before the check is done, we return
value > 0 from generic_perform_write() and thus the information about
signal gets lost.
Fix the problem by doing the signal check before writing anything. That
way generic_perform_write() returns -EINTR, the error gets propagated up
and the sendfile loop terminates early.
Signed-off-by: Jan Kara <jack@suse.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-22 13:32:21 -07:00
|
|
|
if (fatal_signal_pending(current)) {
|
|
|
|
status = -EINTR;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2022-02-22 14:31:43 -05:00
|
|
|
status = a_ops->write_begin(file, mapping, pos, bytes,
|
2024-07-15 14:24:01 -04:00
|
|
|
&folio, &fsdata);
|
2014-06-04 16:10:31 -07:00
|
|
|
if (unlikely(status < 0))
|
2007-10-16 01:25:01 -07:00
|
|
|
break;
|
|
|
|
|
2024-05-27 18:36:08 +02:00
|
|
|
offset = offset_in_folio(folio, pos);
|
|
|
|
if (bytes > folio_size(folio) - offset)
|
|
|
|
bytes = folio_size(folio) - offset;
|
|
|
|
|
mm: flush dcache before writing into page to avoid alias
The cache alias problem will happen if the changes of user shared mapping
is not flushed before copying, then user and kernel mapping may be mapped
into two different cache line, it is impossible to guarantee the coherence
after iov_iter_copy_from_user_atomic. So the right steps should be:
flush_dcache_page(page);
kmap_atomic(page);
write to page;
kunmap_atomic(page);
flush_dcache_page(page);
More precisely, we might create two new APIs flush_dcache_user_page and
flush_dcache_kern_page to replace the two flush_dcache_page accordingly.
Here is a snippet tested on omap2430 with VIPT cache, and I think it is
not ARM-specific:
int val = 0x11111111;
fd = open("abc", O_RDWR);
addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
*(addr+0) = 0x44444444;
tmp = *(addr+0);
*(addr+1) = 0x77777777;
write(fd, &val, sizeof(int));
close(fd);
The results are not always 0x11111111 0x77777777 at the beginning as expected. Sometimes we see 0x44444444 0x77777777.
Signed-off-by: Anfei <anfei.zhou@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-arch@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02 13:44:02 -08:00
|
|
|
if (mapping_writably_mapped(mapping))
|
2024-05-27 18:36:08 +02:00
|
|
|
flush_dcache_folio(folio);
|
2015-10-07 08:32:38 +01:00
|
|
|
|
2024-05-27 18:36:08 +02:00
|
|
|
copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
|
|
|
|
flush_dcache_folio(folio);
|
2007-10-16 01:25:01 -07:00
|
|
|
|
|
|
|
status = a_ops->write_end(file, mapping, pos, bytes, copied,
|
2024-07-10 15:45:32 -04:00
|
|
|
folio, fsdata);
|
2021-04-30 10:26:41 -04:00
|
|
|
if (unlikely(status != copied)) {
|
|
|
|
iov_iter_revert(i, copied - max(status, 0L));
|
|
|
|
if (unlikely(status < 0))
|
|
|
|
break;
|
|
|
|
}
|
2007-10-16 01:25:01 -07:00
|
|
|
cond_resched();
|
|
|
|
|
2021-05-31 00:32:44 -04:00
|
|
|
if (unlikely(status == 0)) {
|
2007-10-16 01:25:01 -07:00
|
|
|
/*
|
2021-05-31 00:32:44 -04:00
|
|
|
* A short copy made ->write_end() reject the
|
|
|
|
* thing entirely. Might be memory poisoning
|
|
|
|
* halfway through, might be a race with munmap,
|
|
|
|
* might be severe memory pressure.
|
2007-10-16 01:25:01 -07:00
|
|
|
*/
|
2024-05-27 18:36:08 +02:00
|
|
|
if (chunk > PAGE_SIZE)
|
|
|
|
chunk /= 2;
|
|
|
|
if (copied) {
|
2021-05-31 00:32:44 -04:00
|
|
|
bytes = copied;
|
2024-05-27 18:36:08 +02:00
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
pos += status;
|
|
|
|
written += status;
|
2007-10-16 01:25:01 -07:00
|
|
|
}
|
|
|
|
} while (iov_iter_count(i));
|
|
|
|
|
2023-06-01 16:58:55 +02:00
|
|
|
if (!written)
|
|
|
|
return status;
|
|
|
|
iocb->ki_pos += written;
|
|
|
|
return written;
|
2007-10-16 01:25:01 -07:00
|
|
|
}
|
2014-02-11 21:34:08 -05:00
|
|
|
EXPORT_SYMBOL(generic_perform_write);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2009-08-17 18:10:06 +02:00
|
|
|
/**
|
2014-04-03 03:17:43 -04:00
|
|
|
* __generic_file_write_iter - write data to a file
|
2009-08-17 18:10:06 +02:00
|
|
|
* @iocb: IO state structure (file, offset, etc.)
|
2014-04-03 03:17:43 -04:00
|
|
|
* @from: iov_iter with data to write
|
2009-08-17 18:10:06 +02:00
|
|
|
*
|
|
|
|
* This function does all the work needed for actually writing data to a
|
|
|
|
* file. It does all basic checks, removes SUID from the file, updates
|
|
|
|
* modification times and calls proper subroutines depending on whether we
|
|
|
|
* do direct IO or a standard buffered write.
|
|
|
|
*
|
2021-04-12 15:50:21 +02:00
|
|
|
* It expects i_rwsem to be grabbed unless we work on a block device or similar
|
2009-08-17 18:10:06 +02:00
|
|
|
* object which does not need locking at all.
|
|
|
|
*
|
|
|
|
* This function does *not* take care of syncing data in case of O_SYNC write.
|
|
|
|
* A caller has to handle it. This is mainly due to the fact that we want to
|
2021-04-12 15:50:21 +02:00
|
|
|
* avoid syncing under i_rwsem.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* * number of bytes written, even for truncated writes
|
|
|
|
* * negative error code if no data has been written at all
|
2009-08-17 18:10:06 +02:00
|
|
|
*/
|
2014-04-03 03:17:43 -04:00
|
|
|
ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
struct file *file = iocb->ki_filp;
|
2021-05-04 18:40:12 -07:00
|
|
|
struct address_space *mapping = file->f_mapping;
|
2023-06-01 16:59:01 +02:00
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
ssize_t ret;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-06-01 16:59:01 +02:00
|
|
|
ret = file_remove_privs(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2023-06-01 16:59:01 +02:00
|
|
|
ret = file_update_time(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2006-10-19 23:28:13 -07:00
|
|
|
|
2015-04-09 13:52:01 -04:00
|
|
|
if (iocb->ki_flags & IOCB_DIRECT) {
|
2023-06-01 16:59:01 +02:00
|
|
|
ret = generic_file_direct_write(iocb, from);
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
2015-02-16 15:58:53 -08:00
|
|
|
* If the write stopped short of completing, fall back to
|
|
|
|
* buffered writes. Some filesystems do this for writes to
|
|
|
|
* holes, for example. For DAX files, a buffered write will
|
|
|
|
* not succeed (even if it did, DAX does not handle dirty
|
|
|
|
* page-cache pages correctly).
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2023-06-01 16:59:01 +02:00
|
|
|
if (ret < 0 || !iov_iter_count(from) || IS_DAX(inode))
|
|
|
|
return ret;
|
|
|
|
return direct_write_fallback(iocb, from, ret,
|
|
|
|
generic_perform_write(iocb, from));
|
2006-10-19 23:28:13 -07:00
|
|
|
}
|
2023-06-01 16:59:01 +02:00
|
|
|
|
|
|
|
return generic_perform_write(iocb, from);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2014-04-03 03:17:43 -04:00
|
|
|
EXPORT_SYMBOL(__generic_file_write_iter);
|
2009-08-17 18:10:06 +02:00
|
|
|
|
|
|
|
/**
|
2014-04-03 03:17:43 -04:00
|
|
|
* generic_file_write_iter - write data to a file
|
2009-08-17 18:10:06 +02:00
|
|
|
* @iocb: IO state structure
|
2014-04-03 03:17:43 -04:00
|
|
|
* @from: iov_iter with data to write
|
2009-08-17 18:10:06 +02:00
|
|
|
*
|
2014-04-03 03:17:43 -04:00
|
|
|
* This is a wrapper around __generic_file_write_iter() to be used by most
|
2009-08-17 18:10:06 +02:00
|
|
|
* filesystems. It takes care of syncing the file in case of O_SYNC file
|
2021-04-12 15:50:21 +02:00
|
|
|
* and acquires i_rwsem as needed.
|
2019-03-05 15:48:42 -08:00
|
|
|
* Return:
|
|
|
|
* * negative error code if no data has been written at all of
|
|
|
|
* vfs_fsync_range() failed for a synchronous write
|
|
|
|
* * number of bytes written, even for truncated writes
|
2009-08-17 18:10:06 +02:00
|
|
|
*/
|
2014-04-03 03:17:43 -04:00
|
|
|
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
struct file *file = iocb->ki_filp;
|
2009-08-17 19:52:36 +02:00
|
|
|
struct inode *inode = file->f_mapping->host;
|
2005-04-16 15:20:36 -07:00
|
|
|
ssize_t ret;
|
|
|
|
|
2016-01-22 15:40:57 -05:00
|
|
|
inode_lock(inode);
|
2015-04-09 12:55:47 -04:00
|
|
|
ret = generic_write_checks(iocb, from);
|
|
|
|
if (ret > 0)
|
2015-04-07 11:28:12 -04:00
|
|
|
ret = __generic_file_write_iter(iocb, from);
|
2016-01-22 15:40:57 -05:00
|
|
|
inode_unlock(inode);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2016-04-07 08:52:01 -07:00
|
|
|
if (ret > 0)
|
|
|
|
ret = generic_write_sync(iocb, ret);
|
2005-04-16 15:20:36 -07:00
|
|
|
return ret;
|
|
|
|
}
|
2014-04-03 03:17:43 -04:00
|
|
|
EXPORT_SYMBOL(generic_file_write_iter);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2006-08-29 19:05:54 +01:00
|
|
|
/**
|
2021-07-28 15:14:48 -04:00
|
|
|
* filemap_release_folio() - Release fs-specific metadata on a folio.
|
|
|
|
* @folio: The folio which the kernel is trying to free.
|
|
|
|
* @gfp: Memory allocation flags (and I/O mode).
|
2006-08-29 19:05:54 +01:00
|
|
|
*
|
2021-07-28 15:14:48 -04:00
|
|
|
* The address_space is trying to release any data attached to a folio
|
|
|
|
* (presumably at folio->private).
|
2006-08-29 19:05:54 +01:00
|
|
|
*
|
2021-07-28 15:14:48 -04:00
|
|
|
* This will also be called if the private_2 flag is set on a page,
|
|
|
|
* indicating that the folio has other metadata associated with it.
|
2009-04-03 16:42:36 +01:00
|
|
|
*
|
2021-07-28 15:14:48 -04:00
|
|
|
* The @gfp argument specifies whether I/O may be performed to release
|
|
|
|
* this page (__GFP_IO), and whether the call may block
|
|
|
|
* (__GFP_RECLAIM & __GFP_FS).
|
2006-08-29 19:05:54 +01:00
|
|
|
*
|
2021-07-28 15:14:48 -04:00
|
|
|
* Return: %true if the release was successful, otherwise %false.
|
2006-08-29 19:05:54 +01:00
|
|
|
*/
|
2021-07-28 15:14:48 -04:00
|
|
|
bool filemap_release_folio(struct folio *folio, gfp_t gfp)
|
2006-08-29 19:05:54 +01:00
|
|
|
{
|
2021-07-28 15:14:48 -04:00
|
|
|
struct address_space * const mapping = folio->mapping;
|
2006-08-29 19:05:54 +01:00
|
|
|
|
2021-07-28 15:14:48 -04:00
|
|
|
BUG_ON(!folio_test_locked(folio));
|
mm: merge folio_has_private()/filemap_release_folio() call pairs
Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.
This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache. The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.
The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.
The second patch is the actual fix. Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set. folio_needs_release() is altered to
add a check for this.
This patch (of 2):
Make filemap_release_folio() check folio_has_private(). Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.
There are a couple of sites in mm/vscan.c that this can't so easily be
done. In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.
In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.
A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory. This will acquire
additional parts to the condition in a future patch.
After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.
Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-28 11:48:51 +01:00
|
|
|
if (!folio_needs_release(folio))
|
|
|
|
return true;
|
2021-07-28 15:14:48 -04:00
|
|
|
if (folio_test_writeback(folio))
|
|
|
|
return false;
|
2006-08-29 19:05:54 +01:00
|
|
|
|
2022-04-29 17:00:05 -04:00
|
|
|
if (mapping && mapping->a_ops->release_folio)
|
|
|
|
return mapping->a_ops->release_folio(folio, gfp);
|
2022-05-01 01:08:08 -04:00
|
|
|
return try_to_free_buffers(folio);
|
2006-08-29 19:05:54 +01:00
|
|
|
}
|
2021-07-28 15:14:48 -04:00
|
|
|
EXPORT_SYMBOL(filemap_release_folio);
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
|
mm: Provide a means of invalidation without using launder_folio
Implement a replacement for launder_folio. The key feature of
invalidate_inode_pages2() is that it locks each folio individually, unmaps
it to prevent mmap'd accesses interfering and calls the ->launder_folio()
address_space op to flush it. This has problems: firstly, each folio is
written individually as one or more small writes; secondly, adjacent folios
cannot be added so easily into the laundry; thirdly, it's yet another op to
implement.
Instead, use the invalidate lock to cause anyone wanting to add a folio to
the inode to wait, then unmap all the folios if we have mmaps, then,
conditionally, use ->writepages() to flush any dirty data back and then
discard all pages.
The invalidate lock prevents ->read_iter(), ->write_iter() and faulting
through mmap all from adding pages for the duration.
This is then used from netfslib to handle the flusing in unbuffered and
direct writes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: devel@lists.orangefs.org
2024-03-27 08:51:38 +00:00
|
|
|
/**
|
|
|
|
* filemap_invalidate_inode - Invalidate/forcibly write back a range of an inode's pagecache
|
|
|
|
* @inode: The inode to flush
|
|
|
|
* @flush: Set to write back rather than simply invalidate.
|
|
|
|
* @start: First byte to in range.
|
|
|
|
* @end: Last byte in range (inclusive), or LLONG_MAX for everything from start
|
|
|
|
* onwards.
|
|
|
|
*
|
|
|
|
* Invalidate all the folios on an inode that contribute to the specified
|
|
|
|
* range, possibly writing them back first. Whilst the operation is
|
|
|
|
* undertaken, the invalidate lock is held to prevent new folios from being
|
|
|
|
* installed.
|
|
|
|
*/
|
|
|
|
int filemap_invalidate_inode(struct inode *inode, bool flush,
|
|
|
|
loff_t start, loff_t end)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = inode->i_mapping;
|
|
|
|
pgoff_t first = start >> PAGE_SHIFT;
|
|
|
|
pgoff_t last = end >> PAGE_SHIFT;
|
|
|
|
pgoff_t nr = end == LLONG_MAX ? ULONG_MAX : last - first + 1;
|
|
|
|
|
|
|
|
if (!mapping || !mapping->nrpages || end < start)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Prevent new folios from being added to the inode. */
|
|
|
|
filemap_invalidate_lock(mapping);
|
|
|
|
|
|
|
|
if (!mapping->nrpages)
|
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
unmap_mapping_pages(mapping, first, nr, false);
|
|
|
|
|
|
|
|
/* Write back the data if we're asked to. */
|
|
|
|
if (flush) {
|
|
|
|
struct writeback_control wbc = {
|
|
|
|
.sync_mode = WB_SYNC_ALL,
|
|
|
|
.nr_to_write = LONG_MAX,
|
|
|
|
.range_start = start,
|
|
|
|
.range_end = end,
|
|
|
|
};
|
|
|
|
|
|
|
|
filemap_fdatawrite_wbc(mapping, &wbc);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Wait for writeback to complete on all folios and discard. */
|
2024-08-28 22:02:45 +01:00
|
|
|
invalidate_inode_pages2_range(mapping, start / PAGE_SIZE, end / PAGE_SIZE);
|
mm: Provide a means of invalidation without using launder_folio
Implement a replacement for launder_folio. The key feature of
invalidate_inode_pages2() is that it locks each folio individually, unmaps
it to prevent mmap'd accesses interfering and calls the ->launder_folio()
address_space op to flush it. This has problems: firstly, each folio is
written individually as one or more small writes; secondly, adjacent folios
cannot be added so easily into the laundry; thirdly, it's yet another op to
implement.
Instead, use the invalidate lock to cause anyone wanting to add a folio to
the inode to wait, then unmap all the folios if we have mmaps, then,
conditionally, use ->writepages() to flush any dirty data back and then
discard all pages.
The invalidate lock prevents ->read_iter(), ->write_iter() and faulting
through mmap all from adding pages for the duration.
This is then used from netfslib to handle the flusing in unbuffered and
direct writes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: devel@lists.orangefs.org
2024-03-27 08:51:38 +00:00
|
|
|
|
|
|
|
unlock:
|
|
|
|
filemap_invalidate_unlock(mapping);
|
|
|
|
out:
|
|
|
|
return filemap_check_errors(mapping);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(filemap_invalidate_inode);
|
|
|
|
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
#ifdef CONFIG_CACHESTAT_SYSCALL
|
|
|
|
/**
|
|
|
|
* filemap_cachestat() - compute the page cache statistics of a mapping
|
|
|
|
* @mapping: The mapping to compute the statistics for.
|
|
|
|
* @first_index: The starting page cache index.
|
|
|
|
* @last_index: The final page index (inclusive).
|
|
|
|
* @cs: the cachestat struct to write the result to.
|
|
|
|
*
|
|
|
|
* This will query the page cache statistics of a mapping in the
|
|
|
|
* page range of [first_index, last_index] (inclusive). The statistics
|
|
|
|
* queried include: number of dirty pages, number of pages marked for
|
|
|
|
* writeback, and the number of (recently) evicted pages.
|
|
|
|
*/
|
|
|
|
static void filemap_cachestat(struct address_space *mapping,
|
|
|
|
pgoff_t first_index, pgoff_t last_index, struct cachestat *cs)
|
|
|
|
{
|
|
|
|
XA_STATE(xas, &mapping->i_pages, first_index);
|
|
|
|
struct folio *folio;
|
|
|
|
|
2024-06-27 13:17:37 -07:00
|
|
|
/* Flush stats (and potentially sleep) outside the RCU read section. */
|
|
|
|
mem_cgroup_flush_stats_ratelimited(NULL);
|
|
|
|
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
rcu_read_lock();
|
|
|
|
xas_for_each(&xas, folio, last_index) {
|
2024-02-19 19:01:21 -08:00
|
|
|
int order;
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
unsigned long nr_pages;
|
|
|
|
pgoff_t folio_first_index, folio_last_index;
|
|
|
|
|
2024-02-19 19:01:21 -08:00
|
|
|
/*
|
|
|
|
* Don't deref the folio. It is not pinned, and might
|
|
|
|
* get freed (and reused) underneath us.
|
|
|
|
*
|
|
|
|
* We *could* pin it, but that would be expensive for
|
|
|
|
* what should be a fast and lightweight syscall.
|
|
|
|
*
|
|
|
|
* Instead, derive all information of interest from
|
|
|
|
* the rcu-protected xarray.
|
|
|
|
*/
|
|
|
|
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
if (xas_retry(&xas, folio))
|
|
|
|
continue;
|
|
|
|
|
2024-09-06 16:05:12 -07:00
|
|
|
order = xas_get_order(&xas);
|
2024-02-19 19:01:21 -08:00
|
|
|
nr_pages = 1 << order;
|
|
|
|
folio_first_index = round_down(xas.xa_index, 1 << order);
|
|
|
|
folio_last_index = folio_first_index + nr_pages - 1;
|
|
|
|
|
|
|
|
/* Folios might straddle the range boundaries, only count covered pages */
|
|
|
|
if (folio_first_index < first_index)
|
|
|
|
nr_pages -= first_index - folio_first_index;
|
|
|
|
|
|
|
|
if (folio_last_index > last_index)
|
|
|
|
nr_pages -= folio_last_index - last_index;
|
|
|
|
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
if (xa_is_value(folio)) {
|
|
|
|
/* page is evicted */
|
|
|
|
void *shadow = (void *)folio;
|
|
|
|
bool workingset; /* not used */
|
|
|
|
|
|
|
|
cs->nr_evicted += nr_pages;
|
|
|
|
|
|
|
|
#ifdef CONFIG_SWAP /* implies CONFIG_MMU */
|
|
|
|
if (shmem_mapping(mapping)) {
|
|
|
|
/* shmem file - in swap cache */
|
|
|
|
swp_entry_t swp = radix_to_swp_entry(folio);
|
|
|
|
|
2024-03-15 05:55:56 -04:00
|
|
|
/* swapin error results in poisoned entry */
|
|
|
|
if (non_swap_entry(swp))
|
|
|
|
goto resched;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Getting a swap entry from the shmem
|
|
|
|
* inode means we beat
|
|
|
|
* shmem_unuse(). rcu_read_lock()
|
|
|
|
* ensures swapoff waits for us before
|
|
|
|
* freeing the swapper space. However,
|
|
|
|
* we can race with swapping and
|
|
|
|
* invalidation, so there might not be
|
|
|
|
* a shadow in the swapcache (yet).
|
|
|
|
*/
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
shadow = get_shadow_from_swap_cache(swp);
|
2024-03-15 05:55:56 -04:00
|
|
|
if (!shadow)
|
|
|
|
goto resched;
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
}
|
|
|
|
#endif
|
2024-06-27 13:17:37 -07:00
|
|
|
if (workingset_test_recent(shadow, true, &workingset, false))
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
cs->nr_recently_evicted += nr_pages;
|
|
|
|
|
|
|
|
goto resched;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* page is in cache */
|
|
|
|
cs->nr_cache += nr_pages;
|
|
|
|
|
2024-02-19 19:01:21 -08:00
|
|
|
if (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY))
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
cs->nr_dirty += nr_pages;
|
|
|
|
|
2024-02-19 19:01:21 -08:00
|
|
|
if (xas_get_mark(&xas, PAGECACHE_TAG_WRITEBACK))
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
cs->nr_writeback += nr_pages;
|
|
|
|
|
|
|
|
resched:
|
|
|
|
if (need_resched()) {
|
|
|
|
xas_pause(&xas);
|
|
|
|
cond_resched_rcu();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The cachestat(2) system call.
|
|
|
|
*
|
|
|
|
* cachestat() returns the page cache statistics of a file in the
|
|
|
|
* bytes range specified by `off` and `len`: number of cached pages,
|
|
|
|
* number of dirty pages, number of pages marked for writeback,
|
|
|
|
* number of evicted pages, and number of recently evicted pages.
|
|
|
|
*
|
|
|
|
* An evicted page is a page that is previously in the page cache
|
|
|
|
* but has been evicted since. A page is recently evicted if its last
|
|
|
|
* eviction was recent enough that its reentry to the cache would
|
|
|
|
* indicate that it is actively being used by the system, and that
|
|
|
|
* there is memory pressure on the system.
|
|
|
|
*
|
|
|
|
* `off` and `len` must be non-negative integers. If `len` > 0,
|
|
|
|
* the queried range is [`off`, `off` + `len`]. If `len` == 0,
|
|
|
|
* we will query in the range from `off` to the end of the file.
|
|
|
|
*
|
|
|
|
* The `flags` argument is unused for now, but is included for future
|
|
|
|
* extensibility. User should pass 0 (i.e no flag specified).
|
|
|
|
*
|
|
|
|
* Currently, hugetlbfs is not supported.
|
|
|
|
*
|
|
|
|
* Because the status of a page can change after cachestat() checks it
|
|
|
|
* but before it returns to the application, the returned values may
|
|
|
|
* contain stale information.
|
|
|
|
*
|
|
|
|
* return values:
|
|
|
|
* zero - success
|
|
|
|
* -EFAULT - cstat or cstat_range points to an illegal address
|
|
|
|
* -EINVAL - invalid flags
|
|
|
|
* -EBADF - invalid file descriptor
|
|
|
|
* -EOPNOTSUPP - file descriptor is of a hugetlbfs file
|
|
|
|
*/
|
|
|
|
SYSCALL_DEFINE4(cachestat, unsigned int, fd,
|
|
|
|
struct cachestat_range __user *, cstat_range,
|
|
|
|
struct cachestat __user *, cstat, unsigned int, flags)
|
|
|
|
{
|
|
|
|
struct fd f = fdget(fd);
|
|
|
|
struct address_space *mapping;
|
|
|
|
struct cachestat_range csr;
|
|
|
|
struct cachestat cs;
|
|
|
|
pgoff_t first_index, last_index;
|
|
|
|
|
2024-05-31 14:12:01 -04:00
|
|
|
if (!fd_file(f))
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
return -EBADF;
|
|
|
|
|
|
|
|
if (copy_from_user(&csr, cstat_range,
|
|
|
|
sizeof(struct cachestat_range))) {
|
|
|
|
fdput(f);
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* hugetlbfs is not supported */
|
2024-05-31 14:12:01 -04:00
|
|
|
if (is_file_hugepages(fd_file(f))) {
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
fdput(f);
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flags != 0) {
|
|
|
|
fdput(f);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
first_index = csr.off >> PAGE_SHIFT;
|
|
|
|
last_index =
|
|
|
|
csr.len == 0 ? ULONG_MAX : (csr.off + csr.len - 1) >> PAGE_SHIFT;
|
|
|
|
memset(&cs, 0, sizeof(struct cachestat));
|
2024-05-31 14:12:01 -04:00
|
|
|
mapping = fd_file(f)->f_mapping;
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
filemap_cachestat(mapping, first_index, last_index, &cs);
|
|
|
|
fdput(f);
|
|
|
|
|
|
|
|
if (copy_to_user(cstat, &cs, sizeof(struct cachestat)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_CACHESTAT_SYSCALL */
|