License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-16 15:20:36 -07:00
|
|
|
#ifndef _LINUX_PAGEMAP_H
|
|
|
|
#define _LINUX_PAGEMAP_H
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copyright 1995 Linus Torvalds
|
|
|
|
*/
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/compiler.h>
|
2016-12-24 11:46:01 -08:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/gfp.h>
|
2007-05-08 00:23:25 -07:00
|
|
|
#include <linux/bitops.h>
|
2008-07-25 19:45:30 -07:00
|
|
|
#include <linux/hardirq.h> /* for in_interrupt() */
|
2010-05-28 09:29:15 +09:00
|
|
|
#include <linux/hugetlb_inline.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2017-11-15 17:37:33 -08:00
|
|
|
struct pagevec;
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
2016-10-11 13:56:04 -07:00
|
|
|
* Bits in mapping->flags.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2009-04-02 16:56:45 -07:00
|
|
|
enum mapping_flags {
|
2016-10-11 13:56:04 -07:00
|
|
|
AS_EIO = 0, /* IO error on async write */
|
|
|
|
AS_ENOSPC = 1, /* ENOSPC on async write */
|
|
|
|
AS_MM_ALL_LOCKS = 2, /* under mm_take_all_locks() */
|
|
|
|
AS_UNEVICTABLE = 3, /* e.g., ramdisk, SHM_LOCK */
|
|
|
|
AS_EXITING = 4, /* final truncate in progress */
|
mm: don't use radix tree writeback tags for pages in swap cache
File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
etc.) to accelerate finding the pages with a specific tag in the radix
tree during inode writeback. But for anonymous pages in the swap cache,
there is no inode writeback. So there is no need to find the pages with
some writeback tags in the radix tree. It is not necessary to touch
radix tree writeback tags for pages in the swap cache.
Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
introduced for address spaces which don't need to update the writeback
tags. The flag is set for swap caches. It may be used for DAX file
systems, etc.
With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
The test is done on a Xeon E5 v3 system. The swap device used is a RAM
simulated PMEM (persistent memory) device. The improvement comes from
the reduced contention on the swap cache radix tree lock. To test
sequential swapping out, the test case uses 8 processes, which
sequentially allocate and write to the anonymous pages until RAM and
part of the swap device is used up.
Details of comparison is as follow,
base base+patch
---------------- --------------------------
%stddev %change %stddev
\ | \
2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput
1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so
10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 16:59:30 -07:00
|
|
|
/* writeback related tags are not used */
|
2016-10-11 13:56:04 -07:00
|
|
|
AS_NO_WRITEBACK_TAGS = 5,
|
2020-10-15 20:06:00 -07:00
|
|
|
AS_THP_SUPPORT = 6, /* THPs supported */
|
2009-04-02 16:56:45 -07:00
|
|
|
};
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2017-07-06 07:02:26 -04:00
|
|
|
/**
|
|
|
|
* mapping_set_error - record a writeback error in the address_space
|
2020-04-01 21:07:55 -07:00
|
|
|
* @mapping: the mapping in which an error should be set
|
|
|
|
* @error: the error to set in the mapping
|
2017-07-06 07:02:26 -04:00
|
|
|
*
|
|
|
|
* When writeback fails in some way, we must record that error so that
|
|
|
|
* userspace can be informed when fsync and the like are called. We endeavor
|
|
|
|
* to report errors on any file that was open at the time of the error. Some
|
|
|
|
* internal callers also need to know when writeback errors have occurred.
|
|
|
|
*
|
|
|
|
* When a writeback error occurs, most filesystems will want to call
|
|
|
|
* mapping_set_error to record the error in the mapping so that it can be
|
|
|
|
* reported when the application calls fsync(2).
|
|
|
|
*/
|
2007-05-08 00:23:25 -07:00
|
|
|
static inline void mapping_set_error(struct address_space *mapping, int error)
|
|
|
|
{
|
2017-07-06 07:02:26 -04:00
|
|
|
if (likely(!error))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Record in wb_err for checkers using errseq_t based tracking */
|
vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.
Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.
The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.
This patch (of 2):
Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.
Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.
It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.
This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.
To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.
An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.
I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-01 21:45:36 -07:00
|
|
|
__filemap_set_wb_err(mapping, error);
|
|
|
|
|
|
|
|
/* Record it in superblock */
|
2020-10-10 23:16:37 -07:00
|
|
|
if (mapping->host)
|
|
|
|
errseq_set(&mapping->host->i_sb->s_wb_err, error);
|
2017-07-06 07:02:26 -04:00
|
|
|
|
|
|
|
/* Record it in flags for now, for legacy callers */
|
|
|
|
if (error == -ENOSPC)
|
|
|
|
set_bit(AS_ENOSPC, &mapping->flags);
|
|
|
|
else
|
|
|
|
set_bit(AS_EIO, &mapping->flags);
|
2007-05-08 00:23:25 -07:00
|
|
|
}
|
|
|
|
|
2008-10-18 20:26:42 -07:00
|
|
|
static inline void mapping_set_unevictable(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
set_bit(AS_UNEVICTABLE, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
2008-10-18 20:26:43 -07:00
|
|
|
static inline void mapping_clear_unevictable(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
clear_bit(AS_UNEVICTABLE, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
mm: swap: make page_evictable() inline
When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping
pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
a couple of vm-scalability's test cases (lru-file-readonce,
lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
on my VM (32c-64g-2nodes). It might be caused by the test configuration,
which is 32c-256g with NUMA disabled and the tests were run in root memcg,
so the tests actually stress only one inactive and active lru. It sounds
not very usual in mordern production environment.
That commit did two major changes:
1. Call page_evictable()
2. Use smp_mb to force the PG_lru set visible
It looks they contribute the most overhead. The page_evictable() is a
function which does function prologue and epilogue, and that was used by
page reclaim path only. However, lru add is a very hot path, so it sounds
better to make it inline. However, it calls page_mapping() which is not
inlined either, but the disassemble shows it doesn't do push and pop
operations and it sounds not very straightforward to inline it.
Other than this, it sounds smp_mb() is not necessary for x86 since
SetPageLRU is atomic which enforces memory barrier already, replace it
with smp_mb__after_atomic() in the following patch.
With the two fixes applied, the tests can get back around 5% on that test
bench and get back normal on my VM. Since the test bench configuration is
not that usual and I also saw around 6% up on the latest upstream, so it
sounds good enough IMHO.
The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
mainline w/ inline fix
150MB 154MB
With this patch the throughput gets 2.67% up. The data with using
smp_mb__after_atomic() is showed in the following patch.
Shakeel Butt did the below test:
On a real machine with limiting the 'dd' on a single node and reading 100
GiB sparse file (less than a single node). Just ran a single instance to
not cause the lru lock contention. The cmdline used is "dd if=file-100GiB
of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and
measured the time it took.
Without patch: 56.64143 +- 0.672 sec
With patches: 56.10 +- 0.21 sec
[akpm@linux-foundation.org: move page_evictable() to internal.h]
Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-01 21:06:20 -07:00
|
|
|
static inline bool mapping_unevictable(struct address_space *mapping)
|
2008-10-18 20:26:42 -07:00
|
|
|
{
|
mm: swap: make page_evictable() inline
When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping
pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
a couple of vm-scalability's test cases (lru-file-readonce,
lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
on my VM (32c-64g-2nodes). It might be caused by the test configuration,
which is 32c-256g with NUMA disabled and the tests were run in root memcg,
so the tests actually stress only one inactive and active lru. It sounds
not very usual in mordern production environment.
That commit did two major changes:
1. Call page_evictable()
2. Use smp_mb to force the PG_lru set visible
It looks they contribute the most overhead. The page_evictable() is a
function which does function prologue and epilogue, and that was used by
page reclaim path only. However, lru add is a very hot path, so it sounds
better to make it inline. However, it calls page_mapping() which is not
inlined either, but the disassemble shows it doesn't do push and pop
operations and it sounds not very straightforward to inline it.
Other than this, it sounds smp_mb() is not necessary for x86 since
SetPageLRU is atomic which enforces memory barrier already, replace it
with smp_mb__after_atomic() in the following patch.
With the two fixes applied, the tests can get back around 5% on that test
bench and get back normal on my VM. Since the test bench configuration is
not that usual and I also saw around 6% up on the latest upstream, so it
sounds good enough IMHO.
The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
mainline w/ inline fix
150MB 154MB
With this patch the throughput gets 2.67% up. The data with using
smp_mb__after_atomic() is showed in the following patch.
Shakeel Butt did the below test:
On a real machine with limiting the 'dd' on a single node and reading 100
GiB sparse file (less than a single node). Just ran a single instance to
not cause the lru lock contention. The cmdline used is "dd if=file-100GiB
of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and
measured the time it took.
Without patch: 56.64143 +- 0.672 sec
With patches: 56.10 +- 0.21 sec
[akpm@linux-foundation.org: move page_evictable() to internal.h]
Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-01 21:06:20 -07:00
|
|
|
return mapping && test_bit(AS_UNEVICTABLE, &mapping->flags);
|
2008-10-18 20:26:42 -07:00
|
|
|
}
|
|
|
|
|
2014-04-03 14:47:49 -07:00
|
|
|
static inline void mapping_set_exiting(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
set_bit(AS_EXITING, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int mapping_exiting(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return test_bit(AS_EXITING, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
mm: don't use radix tree writeback tags for pages in swap cache
File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
etc.) to accelerate finding the pages with a specific tag in the radix
tree during inode writeback. But for anonymous pages in the swap cache,
there is no inode writeback. So there is no need to find the pages with
some writeback tags in the radix tree. It is not necessary to touch
radix tree writeback tags for pages in the swap cache.
Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
introduced for address spaces which don't need to update the writeback
tags. The flag is set for swap caches. It may be used for DAX file
systems, etc.
With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
The test is done on a Xeon E5 v3 system. The swap device used is a RAM
simulated PMEM (persistent memory) device. The improvement comes from
the reduced contention on the swap cache radix tree lock. To test
sequential swapping out, the test case uses 8 processes, which
sequentially allocate and write to the anonymous pages until RAM and
part of the swap device is used up.
Details of comparison is as follow,
base base+patch
---------------- --------------------------
%stddev %change %stddev
\ | \
2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput
1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so
10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 16:59:30 -07:00
|
|
|
static inline void mapping_set_no_writeback_tags(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
set_bit(AS_NO_WRITEBACK_TAGS, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int mapping_use_writeback_tags(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return !test_bit(AS_NO_WRITEBACK_TAGS, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
2005-10-07 07:46:04 +01:00
|
|
|
static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2016-10-11 13:56:04 -07:00
|
|
|
return mapping->gfp_mask;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2015-11-06 16:28:49 -08:00
|
|
|
/* Restricts the given gfp_mask to what the mapping allows. */
|
|
|
|
static inline gfp_t mapping_gfp_constraint(struct address_space *mapping,
|
|
|
|
gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
return mapping_gfp_mask(mapping) & gfp_mask;
|
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* This is non-atomic. Only to be used before the mapping is activated.
|
|
|
|
* Probably needs a barrier...
|
|
|
|
*/
|
2005-10-21 03:22:44 -04:00
|
|
|
static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2016-10-11 13:56:04 -07:00
|
|
|
m->gfp_mask = mask;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2020-10-15 20:06:00 -07:00
|
|
|
static inline bool mapping_thp_support(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return test_bit(AS_THP_SUPPORT, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
2020-10-15 20:06:03 -07:00
|
|
|
static inline int filemap_nr_thps(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
|
|
|
return atomic_read(&mapping->nr_thps);
|
|
|
|
#else
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void filemap_nr_thps_inc(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
|
|
|
if (!mapping_thp_support(mapping))
|
|
|
|
atomic_inc(&mapping->nr_thps);
|
|
|
|
#else
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void filemap_nr_thps_dec(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
|
|
|
if (!mapping_thp_support(mapping))
|
|
|
|
atomic_dec(&mapping->nr_thps);
|
|
|
|
#else
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2017-11-15 17:37:55 -08:00
|
|
|
void release_pages(struct page **pages, int nr);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2008-07-25 19:45:30 -07:00
|
|
|
/*
|
|
|
|
* speculatively take a reference to a page.
|
2016-05-19 17:10:49 -07:00
|
|
|
* If the page is free (_refcount == 0), then _refcount is untouched, and 0
|
|
|
|
* is returned. Otherwise, _refcount is incremented by 1 and 1 is returned.
|
2008-07-25 19:45:30 -07:00
|
|
|
*
|
|
|
|
* This function must be called inside the same rcu_read_lock() section as has
|
|
|
|
* been used to lookup the page in the pagecache radix-tree (or page table):
|
2016-05-19 17:10:49 -07:00
|
|
|
* this allows allocators to use a synchronize_rcu() to stabilize _refcount.
|
2008-07-25 19:45:30 -07:00
|
|
|
*
|
|
|
|
* Unless an RCU grace period has passed, the count of all pages coming out
|
|
|
|
* of the allocator must be considered unstable. page_count may return higher
|
|
|
|
* than expected, and put_page must be able to do the right thing when the
|
|
|
|
* page has been finished with, no matter what it is subsequently allocated
|
|
|
|
* for (because put_page is what is used here to drop an invalid speculative
|
|
|
|
* reference).
|
|
|
|
*
|
|
|
|
* This is the interesting part of the lockless pagecache (and lockless
|
|
|
|
* get_user_pages) locking protocol, where the lookup-side (eg. find_get_page)
|
|
|
|
* has the following pattern:
|
|
|
|
* 1. find page in radix tree
|
|
|
|
* 2. conditionally increment refcount
|
|
|
|
* 3. check the page is still in pagecache (if no, goto 1)
|
|
|
|
*
|
2016-05-19 17:10:49 -07:00
|
|
|
* Remove-side that cares about stability of _refcount (eg. reclaim) has the
|
2018-04-10 16:36:56 -07:00
|
|
|
* following (with the i_pages lock held):
|
2008-07-25 19:45:30 -07:00
|
|
|
* A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
|
|
|
|
* B. remove page from pagecache
|
|
|
|
* C. free the page
|
|
|
|
*
|
|
|
|
* There are 2 critical interleavings that matter:
|
|
|
|
* - 2 runs before A: in this case, A sees elevated refcount and bails out
|
|
|
|
* - A runs before 2: in this case, 2 sees zero refcount and retries;
|
|
|
|
* subsequently, B will complete and 1 will find no page, causing the
|
|
|
|
* lookup to return NULL.
|
|
|
|
*
|
|
|
|
* It is possible that between 1 and 2, the page is removed then the exact same
|
|
|
|
* page is inserted into the same position in pagecache. That's OK: the
|
2018-04-10 16:36:56 -07:00
|
|
|
* old find_get_page using a lock could equally have run before or after
|
2008-07-25 19:45:30 -07:00
|
|
|
* such a re-insertion, depending on order that locks are granted.
|
|
|
|
*
|
|
|
|
* Lookups racing against pagecache insertion isn't a big problem: either 1
|
|
|
|
* will find the page or it will not. Likewise, the old find_get_page could run
|
|
|
|
* either before the insertion or afterwards, depending on timing.
|
|
|
|
*/
|
2019-03-05 15:48:49 -08:00
|
|
|
static inline int __page_cache_add_speculative(struct page *page, int count)
|
2008-07-25 19:45:30 -07:00
|
|
|
{
|
2013-04-29 15:06:13 -07:00
|
|
|
#ifdef CONFIG_TINY_RCU
|
2011-06-08 01:13:27 +02:00
|
|
|
# ifdef CONFIG_PREEMPT_COUNT
|
2017-03-24 14:13:05 +03:00
|
|
|
VM_BUG_ON(!in_atomic() && !irqs_disabled());
|
2008-07-25 19:45:30 -07:00
|
|
|
# endif
|
|
|
|
/*
|
|
|
|
* Preempt must be disabled here - we rely on rcu_read_lock doing
|
|
|
|
* this for us.
|
|
|
|
*
|
|
|
|
* Pagecache won't be truncated from interrupt context, so if we have
|
|
|
|
* found a page in the radix tree here, we have pinned its refcount by
|
|
|
|
* disabling preempt, and hence no need for the "speculative get" that
|
|
|
|
* SMP requires.
|
|
|
|
*/
|
2014-01-23 15:52:54 -08:00
|
|
|
VM_BUG_ON_PAGE(page_count(page) == 0, page);
|
2019-03-05 15:48:49 -08:00
|
|
|
page_ref_add(page, count);
|
2008-07-25 19:45:30 -07:00
|
|
|
|
|
|
|
#else
|
2019-03-05 15:48:49 -08:00
|
|
|
if (unlikely(!page_ref_add_unless(page, count, 0))) {
|
2008-07-25 19:45:30 -07:00
|
|
|
/*
|
|
|
|
* Either the page has been freed, or will be freed.
|
|
|
|
* In either case, retry here and the caller should
|
|
|
|
* do the right thing (see comments above).
|
|
|
|
*/
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
2014-01-23 15:52:54 -08:00
|
|
|
VM_BUG_ON_PAGE(PageTail(page), page);
|
2008-07-25 19:45:30 -07:00
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2019-03-05 15:48:49 -08:00
|
|
|
static inline int page_cache_get_speculative(struct page *page)
|
2008-07-30 15:23:13 +10:00
|
|
|
{
|
2019-03-05 15:48:49 -08:00
|
|
|
return __page_cache_add_speculative(page, 1);
|
|
|
|
}
|
2008-07-30 15:23:13 +10:00
|
|
|
|
2019-03-05 15:48:49 -08:00
|
|
|
static inline int page_cache_add_speculative(struct page *page, int count)
|
|
|
|
{
|
|
|
|
return __page_cache_add_speculative(page, count);
|
2008-07-30 15:23:13 +10:00
|
|
|
}
|
|
|
|
|
2020-06-01 21:47:38 -07:00
|
|
|
/**
|
|
|
|
* attach_page_private - Attach private data to a page.
|
|
|
|
* @page: Page to attach data to.
|
|
|
|
* @data: Data to attach to page.
|
|
|
|
*
|
|
|
|
* Attaching private data to a page increments the page's reference count.
|
|
|
|
* The data must be detached before the page will be freed.
|
|
|
|
*/
|
|
|
|
static inline void attach_page_private(struct page *page, void *data)
|
|
|
|
{
|
|
|
|
get_page(page);
|
|
|
|
set_page_private(page, (unsigned long)data);
|
|
|
|
SetPagePrivate(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* detach_page_private - Detach private data from a page.
|
|
|
|
* @page: Page to detach data from.
|
|
|
|
*
|
|
|
|
* Removes the data that was previously attached to the page and decrements
|
|
|
|
* the refcount on the page.
|
|
|
|
*
|
|
|
|
* Return: Data that was attached to the page.
|
|
|
|
*/
|
|
|
|
static inline void *detach_page_private(struct page *page)
|
|
|
|
{
|
|
|
|
void *data = (void *)page_private(page);
|
|
|
|
|
|
|
|
if (!PagePrivate(page))
|
|
|
|
return NULL;
|
|
|
|
ClearPagePrivate(page);
|
|
|
|
set_page_private(page, 0);
|
|
|
|
put_page(page);
|
|
|
|
|
|
|
|
return data;
|
|
|
|
}
|
|
|
|
|
2006-03-24 03:16:04 -08:00
|
|
|
#ifdef CONFIG_NUMA
|
2006-10-28 10:38:23 -07:00
|
|
|
extern struct page *__page_cache_alloc(gfp_t gfp);
|
2006-03-24 03:16:04 -08:00
|
|
|
#else
|
2006-10-28 10:38:23 -07:00
|
|
|
static inline struct page *__page_cache_alloc(gfp_t gfp)
|
|
|
|
{
|
|
|
|
return alloc_pages(gfp, 0);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
static inline struct page *page_cache_alloc(struct address_space *x)
|
|
|
|
{
|
2006-10-28 10:38:23 -07:00
|
|
|
return __page_cache_alloc(mapping_gfp_mask(x));
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2016-07-26 15:24:53 -07:00
|
|
|
static inline gfp_t readahead_gfp_mask(struct address_space *x)
|
2011-05-24 17:12:25 -07:00
|
|
|
{
|
2017-11-15 17:38:03 -08:00
|
|
|
return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
|
2011-05-24 17:12:25 -07:00
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
typedef int filler_t(void *, struct page *);
|
|
|
|
|
2017-11-21 14:07:06 -05:00
|
|
|
pgoff_t page_cache_next_miss(struct address_space *mapping,
|
2014-04-03 14:47:44 -07:00
|
|
|
pgoff_t index, unsigned long max_scan);
|
2017-11-21 14:07:06 -05:00
|
|
|
pgoff_t page_cache_prev_miss(struct address_space *mapping,
|
2014-04-03 14:47:44 -07:00
|
|
|
pgoff_t index, unsigned long max_scan);
|
|
|
|
|
2014-06-04 16:10:31 -07:00
|
|
|
#define FGP_ACCESSED 0x00000001
|
|
|
|
#define FGP_LOCK 0x00000002
|
|
|
|
#define FGP_CREAT 0x00000004
|
|
|
|
#define FGP_WRITE 0x00000008
|
|
|
|
#define FGP_NOFS 0x00000010
|
|
|
|
#define FGP_NOWAIT 0x00000020
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-13 11:44:14 -07:00
|
|
|
#define FGP_FOR_MMAP 0x00000040
|
2020-10-13 16:51:41 -07:00
|
|
|
#define FGP_HEAD 0x00000080
|
2014-06-04 16:10:31 -07:00
|
|
|
|
|
|
|
struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
|
2014-12-29 20:30:35 +01:00
|
|
|
int fgp_flags, gfp_t cache_gfp_mask);
|
2014-06-04 16:10:31 -07:00
|
|
|
|
|
|
|
/**
|
|
|
|
* find_get_page - find and get a page reference
|
|
|
|
* @mapping: the address_space to search
|
|
|
|
* @offset: the page index
|
|
|
|
*
|
|
|
|
* Looks up the page cache slot at @mapping & @offset. If there is a
|
|
|
|
* page cache page, it is returned with an increased refcount.
|
|
|
|
*
|
|
|
|
* Otherwise, %NULL is returned.
|
|
|
|
*/
|
|
|
|
static inline struct page *find_get_page(struct address_space *mapping,
|
|
|
|
pgoff_t offset)
|
|
|
|
{
|
2014-12-29 20:30:35 +01:00
|
|
|
return pagecache_get_page(mapping, offset, 0, 0);
|
2014-06-04 16:10:31 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *find_get_page_flags(struct address_space *mapping,
|
|
|
|
pgoff_t offset, int fgp_flags)
|
|
|
|
{
|
2014-12-29 20:30:35 +01:00
|
|
|
return pagecache_get_page(mapping, offset, fgp_flags, 0);
|
2014-06-04 16:10:31 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* find_lock_page - locate, pin and lock a pagecache page
|
|
|
|
* @mapping: the address_space to search
|
|
|
|
* @offset: the page index
|
|
|
|
*
|
2020-10-13 16:51:41 -07:00
|
|
|
* Looks up the page cache entry at @mapping & @offset. If there is a
|
2014-06-04 16:10:31 -07:00
|
|
|
* page cache page, it is returned locked and with an increased
|
|
|
|
* refcount.
|
|
|
|
*
|
2020-10-13 16:51:41 -07:00
|
|
|
* Context: May sleep.
|
|
|
|
* Return: A struct page or %NULL if there is no page in the cache for this
|
|
|
|
* index.
|
2014-06-04 16:10:31 -07:00
|
|
|
*/
|
|
|
|
static inline struct page *find_lock_page(struct address_space *mapping,
|
2020-10-13 16:51:41 -07:00
|
|
|
pgoff_t index)
|
|
|
|
{
|
|
|
|
return pagecache_get_page(mapping, index, FGP_LOCK, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* find_lock_head - Locate, pin and lock a pagecache page.
|
|
|
|
* @mapping: The address_space to search.
|
|
|
|
* @offset: The page index.
|
|
|
|
*
|
|
|
|
* Looks up the page cache entry at @mapping & @offset. If there is a
|
|
|
|
* page cache page, its head page is returned locked and with an increased
|
|
|
|
* refcount.
|
|
|
|
*
|
|
|
|
* Context: May sleep.
|
|
|
|
* Return: A struct page which is !PageTail, or %NULL if there is no page
|
|
|
|
* in the cache for this index.
|
|
|
|
*/
|
|
|
|
static inline struct page *find_lock_head(struct address_space *mapping,
|
|
|
|
pgoff_t index)
|
2014-06-04 16:10:31 -07:00
|
|
|
{
|
2020-10-13 16:51:41 -07:00
|
|
|
return pagecache_get_page(mapping, index, FGP_LOCK | FGP_HEAD, 0);
|
2014-06-04 16:10:31 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* find_or_create_page - locate or add a pagecache page
|
|
|
|
* @mapping: the page's address_space
|
|
|
|
* @index: the page's index into the mapping
|
|
|
|
* @gfp_mask: page allocation mode
|
|
|
|
*
|
|
|
|
* Looks up the page cache slot at @mapping & @offset. If there is a
|
|
|
|
* page cache page, it is returned locked and with an increased
|
|
|
|
* refcount.
|
|
|
|
*
|
|
|
|
* If the page is not present, a new page is allocated using @gfp_mask
|
|
|
|
* and added to the page cache and the VM's LRU list. The page is
|
|
|
|
* returned locked and with an increased refcount.
|
|
|
|
*
|
|
|
|
* On memory exhaustion, %NULL is returned.
|
|
|
|
*
|
|
|
|
* find_or_create_page() may sleep, even if @gfp_flags specifies an
|
|
|
|
* atomic allocation!
|
|
|
|
*/
|
|
|
|
static inline struct page *find_or_create_page(struct address_space *mapping,
|
2020-04-01 21:07:55 -07:00
|
|
|
pgoff_t index, gfp_t gfp_mask)
|
2014-06-04 16:10:31 -07:00
|
|
|
{
|
2020-04-01 21:07:55 -07:00
|
|
|
return pagecache_get_page(mapping, index,
|
2014-06-04 16:10:31 -07:00
|
|
|
FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
|
2014-12-29 20:30:35 +01:00
|
|
|
gfp_mask);
|
2014-06-04 16:10:31 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* grab_cache_page_nowait - returns locked page at given index in given cache
|
|
|
|
* @mapping: target address_space
|
|
|
|
* @index: the page index
|
|
|
|
*
|
|
|
|
* Same as grab_cache_page(), but do not wait if the page is unavailable.
|
|
|
|
* This is intended for speculative data generators, where the data can
|
|
|
|
* be regenerated if the page couldn't be grabbed. This routine should
|
|
|
|
* be safe to call while holding the lock for another page.
|
|
|
|
*
|
|
|
|
* Clear __GFP_FS when allocating the page to avoid recursion into the fs
|
|
|
|
* and deadlock against the caller's locked page.
|
|
|
|
*/
|
|
|
|
static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
|
|
|
|
pgoff_t index)
|
|
|
|
{
|
|
|
|
return pagecache_get_page(mapping, index,
|
|
|
|
FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
|
2014-12-29 20:30:35 +01:00
|
|
|
mapping_gfp_mask(mapping));
|
2014-06-04 16:10:31 -07:00
|
|
|
}
|
|
|
|
|
2020-10-13 16:51:38 -07:00
|
|
|
/* Does this page contain this index? */
|
|
|
|
static inline bool thp_contains(struct page *head, pgoff_t index)
|
|
|
|
{
|
|
|
|
/* HugeTLBfs indexes the page cache in units of hpage_size */
|
|
|
|
if (PageHuge(head))
|
|
|
|
return head->index == index;
|
|
|
|
return page_index(head) == (index & ~(thp_nr_pages(head) - 1UL));
|
|
|
|
}
|
|
|
|
|
2020-04-01 21:04:57 -07:00
|
|
|
/*
|
|
|
|
* Given the page we found in the page cache, return the page corresponding
|
|
|
|
* to this index in the file
|
|
|
|
*/
|
|
|
|
static inline struct page *find_subpage(struct page *head, pgoff_t index)
|
2019-09-23 15:34:52 -07:00
|
|
|
{
|
2020-04-01 21:04:57 -07:00
|
|
|
/* HugeTLBfs wants the head page regardless */
|
|
|
|
if (PageHuge(head))
|
|
|
|
return head;
|
2019-09-23 15:34:52 -07:00
|
|
|
|
2020-08-14 17:30:37 -07:00
|
|
|
return head + (index & (thp_nr_pages(head) - 1));
|
2019-09-23 15:34:52 -07:00
|
|
|
}
|
|
|
|
|
2014-04-03 14:47:46 -07:00
|
|
|
unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
|
|
|
|
unsigned int nr_entries, struct page **entries,
|
|
|
|
pgoff_t *indices);
|
2017-09-06 16:21:21 -07:00
|
|
|
unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
|
|
|
|
pgoff_t end, unsigned int nr_pages,
|
|
|
|
struct page **pages);
|
|
|
|
static inline unsigned find_get_pages(struct address_space *mapping,
|
|
|
|
pgoff_t *start, unsigned int nr_pages,
|
|
|
|
struct page **pages)
|
|
|
|
{
|
|
|
|
return find_get_pages_range(mapping, start, (pgoff_t)-1, nr_pages,
|
|
|
|
pages);
|
|
|
|
}
|
2006-04-27 08:46:01 +02:00
|
|
|
unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
|
|
|
|
unsigned int nr_pages, struct page **pages);
|
2017-11-15 17:34:33 -08:00
|
|
|
unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t *index,
|
2018-05-16 18:12:54 -04:00
|
|
|
pgoff_t end, xa_mark_t tag, unsigned int nr_pages,
|
2017-11-15 17:34:33 -08:00
|
|
|
struct page **pages);
|
|
|
|
static inline unsigned find_get_pages_tag(struct address_space *mapping,
|
2018-05-16 18:12:54 -04:00
|
|
|
pgoff_t *index, xa_mark_t tag, unsigned int nr_pages,
|
2017-11-15 17:34:33 -08:00
|
|
|
struct page **pages)
|
|
|
|
{
|
|
|
|
return find_get_pages_range_tag(mapping, index, (pgoff_t)-1, tag,
|
|
|
|
nr_pages, pages);
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
fs: symlink write_begin allocation context fix
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-04 12:00:53 -08:00
|
|
|
struct page *grab_cache_page_write_begin(struct address_space *mapping,
|
|
|
|
pgoff_t index, unsigned flags);
|
2007-10-16 01:25:01 -07:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* Returns locked page at given index in given cache, creating it if needed.
|
|
|
|
*/
|
2007-10-16 01:24:37 -07:00
|
|
|
static inline struct page *grab_cache_page(struct address_space *mapping,
|
|
|
|
pgoff_t index)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
|
|
|
|
}
|
|
|
|
|
|
|
|
extern struct page * read_cache_page(struct address_space *mapping,
|
2011-07-25 17:12:23 -07:00
|
|
|
pgoff_t index, filler_t *filler, void *data);
|
2010-01-27 09:20:03 -08:00
|
|
|
extern struct page * read_cache_page_gfp(struct address_space *mapping,
|
|
|
|
pgoff_t index, gfp_t gfp_mask);
|
2005-04-16 15:20:36 -07:00
|
|
|
extern int read_cache_pages(struct address_space *mapping,
|
|
|
|
struct list_head *pages, filler_t *filler, void *data);
|
|
|
|
|
2006-06-23 02:05:08 -07:00
|
|
|
static inline struct page *read_mapping_page(struct address_space *mapping,
|
2011-07-25 17:12:23 -07:00
|
|
|
pgoff_t index, void *data)
|
2006-06-23 02:05:08 -07:00
|
|
|
{
|
2019-07-11 20:55:20 -07:00
|
|
|
return read_cache_page(mapping, index, NULL, data);
|
2006-06-23 02:05:08 -07:00
|
|
|
}
|
|
|
|
|
2014-07-23 14:00:01 -07:00
|
|
|
/*
|
2016-11-30 15:54:19 -08:00
|
|
|
* Get index of the page with in radix-tree
|
|
|
|
* (TODO: remove once hugetlb pages will have ->index in PAGE_SIZE)
|
2014-07-23 14:00:01 -07:00
|
|
|
*/
|
2016-11-30 15:54:19 -08:00
|
|
|
static inline pgoff_t page_to_index(struct page *page)
|
2014-07-23 14:00:01 -07:00
|
|
|
{
|
2016-01-15 16:54:10 -08:00
|
|
|
pgoff_t pgoff;
|
|
|
|
|
|
|
|
if (likely(!PageTransTail(page)))
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
return page->index;
|
2016-01-15 16:54:10 -08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't initialize ->index for tail pages: calculate based on
|
|
|
|
* head page
|
|
|
|
*/
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
pgoff = compound_head(page)->index;
|
2016-01-15 16:54:10 -08:00
|
|
|
pgoff += page - compound_head(page);
|
|
|
|
return pgoff;
|
2014-07-23 14:00:01 -07:00
|
|
|
}
|
|
|
|
|
2016-11-30 15:54:19 -08:00
|
|
|
/*
|
|
|
|
* Get the offset in PAGE_SIZE.
|
|
|
|
* (TODO: hugepage should have ->index in PAGE_SIZE)
|
|
|
|
*/
|
|
|
|
static inline pgoff_t page_to_pgoff(struct page *page)
|
|
|
|
{
|
|
|
|
if (unlikely(PageHeadHuge(page)))
|
|
|
|
return page->index << compound_order(page);
|
|
|
|
|
|
|
|
return page_to_index(page);
|
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* Return byte-offset into filesystem object for page.
|
|
|
|
*/
|
|
|
|
static inline loff_t page_offset(struct page *page)
|
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
return ((loff_t)page->index) << PAGE_SHIFT;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2012-07-31 16:44:47 -07:00
|
|
|
static inline loff_t page_file_offset(struct page *page)
|
|
|
|
{
|
2016-10-07 17:00:24 -07:00
|
|
|
return ((loff_t)page_index(page)) << PAGE_SHIFT;
|
2012-07-31 16:44:47 -07:00
|
|
|
}
|
|
|
|
|
2010-05-28 09:29:16 +09:00
|
|
|
extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
|
|
|
|
unsigned long address);
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
2010-05-28 09:29:16 +09:00
|
|
|
pgoff_t pgoff;
|
|
|
|
if (unlikely(is_vm_hugetlb_page(vma)))
|
|
|
|
return linear_hugepage_index(vma, address);
|
|
|
|
pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
|
2005-04-16 15:20:36 -07:00
|
|
|
pgoff += vma->vm_pgoff;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
return pgoff;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2020-05-23 08:22:14 -06:00
|
|
|
/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */
|
|
|
|
struct wait_page_key {
|
|
|
|
struct page *page;
|
|
|
|
int bit_nr;
|
|
|
|
int page_match;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct wait_page_queue {
|
|
|
|
struct page *page;
|
|
|
|
int bit_nr;
|
|
|
|
wait_queue_entry_t wait;
|
|
|
|
};
|
|
|
|
|
2020-08-03 13:01:22 -07:00
|
|
|
static inline bool wake_page_match(struct wait_page_queue *wait_page,
|
2020-05-23 08:22:14 -06:00
|
|
|
struct wait_page_key *key)
|
|
|
|
{
|
|
|
|
if (wait_page->page != key->page)
|
2020-08-03 13:01:22 -07:00
|
|
|
return false;
|
2020-05-23 08:22:14 -06:00
|
|
|
key->page_match = 1;
|
|
|
|
|
|
|
|
if (wait_page->bit_nr != key->bit_nr)
|
2020-08-03 13:01:22 -07:00
|
|
|
return false;
|
2020-05-22 10:18:23 -06:00
|
|
|
|
2020-08-03 13:01:22 -07:00
|
|
|
return true;
|
2020-05-22 10:18:23 -06:00
|
|
|
}
|
|
|
|
|
2008-02-13 15:03:15 -08:00
|
|
|
extern void __lock_page(struct page *page);
|
|
|
|
extern int __lock_page_killable(struct page *page);
|
2020-05-22 09:12:09 -06:00
|
|
|
extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
|
2010-10-26 14:21:57 -07:00
|
|
|
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
|
|
|
|
unsigned int flags);
|
2008-02-13 15:03:15 -08:00
|
|
|
extern void unlock_page(struct page *page);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2019-07-11 20:54:59 -07:00
|
|
|
/*
|
|
|
|
* Return true if the page was successfully locked
|
|
|
|
*/
|
2008-08-02 12:01:03 +02:00
|
|
|
static inline int trylock_page(struct page *page)
|
|
|
|
{
|
2016-01-15 16:51:24 -08:00
|
|
|
page = compound_head(page);
|
2008-10-18 20:26:59 -07:00
|
|
|
return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
|
2008-08-02 12:01:03 +02:00
|
|
|
}
|
|
|
|
|
2006-09-25 23:31:24 -07:00
|
|
|
/*
|
|
|
|
* lock_page may only be called if we have the page's inode pinned.
|
|
|
|
*/
|
2005-04-16 15:20:36 -07:00
|
|
|
static inline void lock_page(struct page *page)
|
|
|
|
{
|
|
|
|
might_sleep();
|
2008-08-02 12:01:03 +02:00
|
|
|
if (!trylock_page(page))
|
2005-04-16 15:20:36 -07:00
|
|
|
__lock_page(page);
|
|
|
|
}
|
2006-09-25 23:31:24 -07:00
|
|
|
|
2007-12-06 11:18:49 -05:00
|
|
|
/*
|
|
|
|
* lock_page_killable is like lock_page but can be interrupted by fatal
|
|
|
|
* signals. It returns 0 if it locked the page and -EINTR if it was
|
|
|
|
* killed while waiting.
|
|
|
|
*/
|
|
|
|
static inline int lock_page_killable(struct page *page)
|
|
|
|
{
|
|
|
|
might_sleep();
|
2008-08-02 12:01:03 +02:00
|
|
|
if (!trylock_page(page))
|
2007-12-06 11:18:49 -05:00
|
|
|
return __lock_page_killable(page);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-05-22 09:12:09 -06:00
|
|
|
/*
|
|
|
|
* lock_page_async - Lock the page, unless this would block. If the page
|
|
|
|
* is already locked, then queue a callback when the page becomes unlocked.
|
|
|
|
* This callback can then retry the operation.
|
|
|
|
*
|
|
|
|
* Returns 0 if the page is locked successfully, or -EIOCBQUEUED if the page
|
|
|
|
* was already locked and the callback defined in 'wait' was queued.
|
|
|
|
*/
|
|
|
|
static inline int lock_page_async(struct page *page,
|
|
|
|
struct wait_page_queue *wait)
|
|
|
|
{
|
|
|
|
if (!trylock_page(page))
|
|
|
|
return __lock_page_async(page, wait);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-10-26 14:21:57 -07:00
|
|
|
/*
|
|
|
|
* lock_page_or_retry - Lock the page, unless this would block and the
|
|
|
|
* caller indicated that it can handle a retry.
|
2014-08-06 16:07:24 -07:00
|
|
|
*
|
2020-06-08 21:33:54 -07:00
|
|
|
* Return value and mmap_lock implications depend on flags; see
|
2014-08-06 16:07:24 -07:00
|
|
|
* __lock_page_or_retry().
|
2010-10-26 14:21:57 -07:00
|
|
|
*/
|
|
|
|
static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
|
|
|
|
unsigned int flags)
|
|
|
|
{
|
|
|
|
might_sleep();
|
|
|
|
return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
|
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
2017-02-22 15:44:41 -08:00
|
|
|
* This is exported only for wait_on_page_locked/wait_on_page_writeback, etc.,
|
|
|
|
* and should not be used directly.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2008-02-13 15:03:15 -08:00
|
|
|
extern void wait_on_page_bit(struct page *page, int bit_nr);
|
2011-05-24 17:11:29 -07:00
|
|
|
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
|
2014-09-24 11:28:32 +10:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* Wait for a page to be unlocked.
|
|
|
|
*
|
|
|
|
* This must be called with the caller "holding" the page,
|
|
|
|
* ie with increased "page->count" so that the page won't
|
|
|
|
* go away during the wait..
|
|
|
|
*/
|
|
|
|
static inline void wait_on_page_locked(struct page *page)
|
|
|
|
{
|
|
|
|
if (PageLocked(page))
|
2016-01-15 16:51:24 -08:00
|
|
|
wait_on_page_bit(compound_head(page), PG_locked);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2016-12-25 13:00:30 +10:00
|
|
|
static inline int wait_on_page_locked_killable(struct page *page)
|
|
|
|
{
|
|
|
|
if (!PageLocked(page))
|
|
|
|
return 0;
|
|
|
|
return wait_on_page_bit_killable(compound_head(page), PG_locked);
|
|
|
|
}
|
|
|
|
|
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.
And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page. With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).
This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.
David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2
Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")
Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2
We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.
But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it. That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().
Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.
__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced. Should we perhaps disable preemption across this?
shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
survived a lot of testing before that showed up. PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used. Follow the suggestion from
Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.
It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held? Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.
[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 00:36:14 -08:00
|
|
|
extern void put_and_wait_on_page_locked(struct page *page);
|
|
|
|
|
2019-05-13 17:23:11 -07:00
|
|
|
void wait_on_page_writeback(struct page *page);
|
2005-04-16 15:20:36 -07:00
|
|
|
extern void end_page_writeback(struct page *page);
|
mm: only enforce stable page writes if the backing device requires it
Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait. Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function. This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.
Here's the result of using dbench to test latency on ext2:
3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283
Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms
3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, the maximum write latency drops considerably with this
patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 16:42:51 -08:00
|
|
|
void wait_for_stable_page(struct page *page);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2016-08-05 08:11:04 -06:00
|
|
|
void page_endio(struct page *page, bool is_write, int err);
|
2014-06-04 16:07:45 -07:00
|
|
|
|
2009-04-03 16:42:39 +01:00
|
|
|
/*
|
|
|
|
* Add an arbitrary waiter to a page's wait queue
|
|
|
|
*/
|
2017-06-20 12:06:13 +02:00
|
|
|
extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
|
2009-04-03 16:42:39 +01:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
2016-09-17 18:02:44 -04:00
|
|
|
* Fault everything in given userspace address range in.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
|
|
|
static inline int fault_in_pages_writeable(char __user *uaddr, int size)
|
2012-03-25 19:47:41 +02:00
|
|
|
{
|
2012-04-14 18:03:10 +02:00
|
|
|
char __user *end = uaddr + size - 1;
|
2012-03-25 19:47:41 +02:00
|
|
|
|
|
|
|
if (unlikely(size == 0))
|
2016-09-20 20:07:42 +01:00
|
|
|
return 0;
|
2012-03-25 19:47:41 +02:00
|
|
|
|
2016-09-20 20:07:42 +01:00
|
|
|
if (unlikely(uaddr > end))
|
|
|
|
return -EFAULT;
|
2012-03-25 19:47:41 +02:00
|
|
|
/*
|
|
|
|
* Writing zeroes into userspace here is OK, because we know that if
|
|
|
|
* the zero gets there, we'll be overwriting it.
|
|
|
|
*/
|
2016-09-20 20:07:42 +01:00
|
|
|
do {
|
|
|
|
if (unlikely(__put_user(0, uaddr) != 0))
|
|
|
|
return -EFAULT;
|
2012-03-25 19:47:41 +02:00
|
|
|
uaddr += PAGE_SIZE;
|
2016-09-20 20:07:42 +01:00
|
|
|
} while (uaddr <= end);
|
2012-03-25 19:47:41 +02:00
|
|
|
|
|
|
|
/* Check whether the range spilled into the next page. */
|
|
|
|
if (((unsigned long)uaddr & PAGE_MASK) ==
|
|
|
|
((unsigned long)end & PAGE_MASK))
|
2016-09-20 20:07:42 +01:00
|
|
|
return __put_user(0, end);
|
2012-03-25 19:47:41 +02:00
|
|
|
|
2016-09-20 20:07:42 +01:00
|
|
|
return 0;
|
2012-03-25 19:47:41 +02:00
|
|
|
}
|
|
|
|
|
2016-09-17 18:02:44 -04:00
|
|
|
static inline int fault_in_pages_readable(const char __user *uaddr, int size)
|
2012-03-25 19:47:41 +02:00
|
|
|
{
|
|
|
|
volatile char c;
|
|
|
|
const char __user *end = uaddr + size - 1;
|
|
|
|
|
|
|
|
if (unlikely(size == 0))
|
2016-09-20 20:07:42 +01:00
|
|
|
return 0;
|
2012-03-25 19:47:41 +02:00
|
|
|
|
2016-09-20 20:07:42 +01:00
|
|
|
if (unlikely(uaddr > end))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
do {
|
|
|
|
if (unlikely(__get_user(c, uaddr) != 0))
|
|
|
|
return -EFAULT;
|
2012-03-25 19:47:41 +02:00
|
|
|
uaddr += PAGE_SIZE;
|
2016-09-20 20:07:42 +01:00
|
|
|
} while (uaddr <= end);
|
2012-03-25 19:47:41 +02:00
|
|
|
|
|
|
|
/* Check whether the range spilled into the next page. */
|
|
|
|
if (((unsigned long)uaddr & PAGE_MASK) ==
|
|
|
|
((unsigned long)end & PAGE_MASK)) {
|
2016-09-20 20:07:42 +01:00
|
|
|
return __get_user(c, end);
|
2012-03-25 19:47:41 +02:00
|
|
|
}
|
|
|
|
|
2016-09-26 09:57:33 +10:00
|
|
|
(void)c;
|
2016-09-20 20:07:42 +01:00
|
|
|
return 0;
|
2012-03-25 19:47:41 +02:00
|
|
|
}
|
|
|
|
|
2008-08-02 12:01:03 +02:00
|
|
|
int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
|
|
|
|
pgoff_t index, gfp_t gfp_mask);
|
|
|
|
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
|
|
|
|
pgoff_t index, gfp_t gfp_mask);
|
2011-03-22 16:30:53 -07:00
|
|
|
extern void delete_from_page_cache(struct page *page);
|
2016-03-15 14:57:22 -07:00
|
|
|
extern void __delete_from_page_cache(struct page *page, void *shadow);
|
2011-03-22 16:30:52 -07:00
|
|
|
int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
|
2017-11-15 17:37:33 -08:00
|
|
|
void delete_from_page_cache_batch(struct address_space *mapping,
|
|
|
|
struct pagevec *pvec);
|
2008-08-02 12:01:03 +02:00
|
|
|
|
mm: move readahead prototypes from mm.h
Patch series "Change readahead API", v11.
This series adds a readahead address_space operation to replace the
readpages operation. The key difference is that pages are added to the
page cache as they are allocated (and then looked up by the filesystem)
instead of passing them on a list to the readpages operation and having
the filesystem add them to the page cache. It's a net reduction in code
for each implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com
The only unconverted filesystems are those which use fscache. Their
conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier. This should be completed by the end of
the year.
I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
Miklos Szeredi have done a marvellous job of providing constructive
criticism.
These patches pass an xfstests run on ext4, xfs & btrfs with no
regressions that I can tell (some of the tests seem a little flaky
before and remain flaky afterwards).
This patch (of 25):
The readahead code is part of the page cache so should be found in the
pagemap.h file. force_page_cache_readahead is only used within mm, so
move it to mm/internal.h instead. Remove the parameter names where they
add no value, and rename the ones which were actively misleading.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-01 21:46:07 -07:00
|
|
|
#define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE)
|
|
|
|
|
|
|
|
void page_cache_sync_readahead(struct address_space *, struct file_ra_state *,
|
|
|
|
struct file *, pgoff_t index, unsigned long req_count);
|
|
|
|
void page_cache_async_readahead(struct address_space *, struct file_ra_state *,
|
|
|
|
struct file *, struct page *, pgoff_t index,
|
|
|
|
unsigned long req_count);
|
2020-10-15 20:06:14 -07:00
|
|
|
void page_cache_ra_unbounded(struct readahead_control *,
|
|
|
|
unsigned long nr_to_read, unsigned long lookahead_count);
|
mm: move readahead prototypes from mm.h
Patch series "Change readahead API", v11.
This series adds a readahead address_space operation to replace the
readpages operation. The key difference is that pages are added to the
page cache as they are allocated (and then looked up by the filesystem)
instead of passing them on a list to the readpages operation and having
the filesystem add them to the page cache. It's a net reduction in code
for each implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com
The only unconverted filesystems are those which use fscache. Their
conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier. This should be completed by the end of
the year.
I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
Miklos Szeredi have done a marvellous job of providing constructive
criticism.
These patches pass an xfstests run on ext4, xfs & btrfs with no
regressions that I can tell (some of the tests seem a little flaky
before and remain flaky afterwards).
This patch (of 25):
The readahead code is part of the page cache so should be found in the
pagemap.h file. force_page_cache_readahead is only used within mm, so
move it to mm/internal.h instead. Remove the parameter names where they
add no value, and rename the ones which were actively misleading.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-01 21:46:07 -07:00
|
|
|
|
2008-08-02 12:01:03 +02:00
|
|
|
/*
|
|
|
|
* Like add_to_page_cache_locked, but used to add newly allocated pages:
|
2016-01-15 16:51:24 -08:00
|
|
|
* the page is new, so we can just run __SetPageLocked() against it.
|
2008-08-02 12:01:03 +02:00
|
|
|
*/
|
|
|
|
static inline int add_to_page_cache(struct page *page,
|
|
|
|
struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
2016-01-15 16:51:24 -08:00
|
|
|
__SetPageLocked(page);
|
2008-08-02 12:01:03 +02:00
|
|
|
error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
|
|
|
|
if (unlikely(error))
|
2016-01-15 16:51:24 -08:00
|
|
|
__ClearPageLocked(page);
|
2008-08-02 12:01:03 +02:00
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2020-06-01 21:46:21 -07:00
|
|
|
/**
|
|
|
|
* struct readahead_control - Describes a readahead request.
|
|
|
|
*
|
|
|
|
* A readahead request is for consecutive pages. Filesystems which
|
|
|
|
* implement the ->readahead method should call readahead_page() or
|
|
|
|
* readahead_page_batch() in a loop and attempt to start I/O against
|
|
|
|
* each page in the request.
|
|
|
|
*
|
|
|
|
* Most of the fields in this struct are private and should be accessed
|
|
|
|
* by the functions below.
|
|
|
|
*
|
|
|
|
* @file: The file, used primarily by network filesystems for authentication.
|
|
|
|
* May be NULL if invoked internally by the filesystem.
|
|
|
|
* @mapping: Readahead this filesystem object.
|
|
|
|
*/
|
|
|
|
struct readahead_control {
|
|
|
|
struct file *file;
|
|
|
|
struct address_space *mapping;
|
|
|
|
/* private: use the readahead_* accessors instead */
|
|
|
|
pgoff_t _index;
|
|
|
|
unsigned int _nr_pages;
|
|
|
|
unsigned int _batch_count;
|
|
|
|
};
|
|
|
|
|
2020-10-15 20:06:10 -07:00
|
|
|
#define DEFINE_READAHEAD(rac, f, m, i) \
|
|
|
|
struct readahead_control rac = { \
|
|
|
|
.file = f, \
|
|
|
|
.mapping = m, \
|
|
|
|
._index = i, \
|
|
|
|
}
|
|
|
|
|
2020-06-01 21:46:21 -07:00
|
|
|
/**
|
|
|
|
* readahead_page - Get the next page to read.
|
|
|
|
* @rac: The current readahead request.
|
|
|
|
*
|
|
|
|
* Context: The page is locked and has an elevated refcount. The caller
|
|
|
|
* should decreases the refcount once the page has been submitted for I/O
|
|
|
|
* and unlock the page once all I/O to that page has completed.
|
|
|
|
* Return: A pointer to the next page, or %NULL if we are done.
|
|
|
|
*/
|
|
|
|
static inline struct page *readahead_page(struct readahead_control *rac)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
BUG_ON(rac->_batch_count > rac->_nr_pages);
|
|
|
|
rac->_nr_pages -= rac->_batch_count;
|
|
|
|
rac->_index += rac->_batch_count;
|
|
|
|
|
|
|
|
if (!rac->_nr_pages) {
|
|
|
|
rac->_batch_count = 0;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
page = xa_load(&rac->mapping->i_pages, rac->_index);
|
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
2020-08-14 17:30:37 -07:00
|
|
|
rac->_batch_count = thp_nr_pages(page);
|
2020-06-01 21:46:21 -07:00
|
|
|
|
|
|
|
return page;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int __readahead_batch(struct readahead_control *rac,
|
|
|
|
struct page **array, unsigned int array_sz)
|
|
|
|
{
|
|
|
|
unsigned int i = 0;
|
|
|
|
XA_STATE(xas, &rac->mapping->i_pages, 0);
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
BUG_ON(rac->_batch_count > rac->_nr_pages);
|
|
|
|
rac->_nr_pages -= rac->_batch_count;
|
|
|
|
rac->_index += rac->_batch_count;
|
|
|
|
rac->_batch_count = 0;
|
|
|
|
|
|
|
|
xas_set(&xas, rac->_index);
|
|
|
|
rcu_read_lock();
|
|
|
|
xas_for_each(&xas, page, rac->_index + rac->_nr_pages - 1) {
|
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
|
|
VM_BUG_ON_PAGE(PageTail(page), page);
|
|
|
|
array[i++] = page;
|
2020-08-14 17:30:37 -07:00
|
|
|
rac->_batch_count += thp_nr_pages(page);
|
2020-06-01 21:46:21 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The page cache isn't using multi-index entries yet,
|
|
|
|
* so the xas cursor needs to be manually moved to the
|
|
|
|
* next index. This can be removed once the page cache
|
|
|
|
* is converted.
|
|
|
|
*/
|
|
|
|
if (PageHead(page))
|
|
|
|
xas_set(&xas, rac->_index + rac->_batch_count);
|
|
|
|
|
|
|
|
if (i == array_sz)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_page_batch - Get a batch of pages to read.
|
|
|
|
* @rac: The current readahead request.
|
|
|
|
* @array: An array of pointers to struct page.
|
|
|
|
*
|
|
|
|
* Context: The pages are locked and have an elevated refcount. The caller
|
|
|
|
* should decreases the refcount once the page has been submitted for I/O
|
|
|
|
* and unlock the page once all I/O to that page has completed.
|
|
|
|
* Return: The number of pages placed in the array. 0 indicates the request
|
|
|
|
* is complete.
|
|
|
|
*/
|
|
|
|
#define readahead_page_batch(rac, array) \
|
|
|
|
__readahead_batch(rac, array, ARRAY_SIZE(array))
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_pos - The byte offset into the file of this readahead request.
|
|
|
|
* @rac: The readahead request.
|
|
|
|
*/
|
|
|
|
static inline loff_t readahead_pos(struct readahead_control *rac)
|
|
|
|
{
|
|
|
|
return (loff_t)rac->_index * PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_length - The number of bytes in this readahead request.
|
|
|
|
* @rac: The readahead request.
|
|
|
|
*/
|
|
|
|
static inline loff_t readahead_length(struct readahead_control *rac)
|
|
|
|
{
|
|
|
|
return (loff_t)rac->_nr_pages * PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_index - The index of the first page in this readahead request.
|
|
|
|
* @rac: The readahead request.
|
|
|
|
*/
|
|
|
|
static inline pgoff_t readahead_index(struct readahead_control *rac)
|
|
|
|
{
|
|
|
|
return rac->_index;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_count - The number of pages in this readahead request.
|
|
|
|
* @rac: The readahead request.
|
|
|
|
*/
|
|
|
|
static inline unsigned int readahead_count(struct readahead_control *rac)
|
|
|
|
{
|
|
|
|
return rac->_nr_pages;
|
|
|
|
}
|
|
|
|
|
2015-05-24 17:19:41 +02:00
|
|
|
static inline unsigned long dir_pages(struct inode *inode)
|
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
|
|
|
|
PAGE_SHIFT;
|
2015-05-24 17:19:41 +02:00
|
|
|
}
|
|
|
|
|
2020-01-06 08:58:23 -08:00
|
|
|
/**
|
|
|
|
* page_mkwrite_check_truncate - check if page was truncated
|
|
|
|
* @page: the page to check
|
|
|
|
* @inode: the inode to check the page against
|
|
|
|
*
|
|
|
|
* Returns the number of bytes in the page up to EOF,
|
|
|
|
* or -EFAULT if the page was truncated.
|
|
|
|
*/
|
|
|
|
static inline int page_mkwrite_check_truncate(struct page *page,
|
|
|
|
struct inode *inode)
|
|
|
|
{
|
|
|
|
loff_t size = i_size_read(inode);
|
|
|
|
pgoff_t index = size >> PAGE_SHIFT;
|
|
|
|
int offset = offset_in_page(size);
|
|
|
|
|
|
|
|
if (page->mapping != inode->i_mapping)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/* page is wholly inside EOF */
|
|
|
|
if (page->index < index)
|
|
|
|
return PAGE_SIZE;
|
|
|
|
/* page is wholly past EOF */
|
|
|
|
if (page->index > index || !offset)
|
|
|
|
return -EFAULT;
|
|
|
|
/* page is partially inside EOF */
|
|
|
|
return offset;
|
|
|
|
}
|
|
|
|
|
2020-09-21 08:58:39 -07:00
|
|
|
/**
|
|
|
|
* i_blocks_per_page - How many blocks fit in this page.
|
|
|
|
* @inode: The inode which contains the blocks.
|
|
|
|
* @page: The page (head page if the page is a THP).
|
|
|
|
*
|
|
|
|
* If the block size is larger than the size of this page, return zero.
|
|
|
|
*
|
|
|
|
* Context: The caller should hold a refcount on the page to prevent it
|
|
|
|
* from being split.
|
|
|
|
* Return: The number of filesystem blocks covered by this page.
|
|
|
|
*/
|
|
|
|
static inline
|
|
|
|
unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
|
|
|
|
{
|
|
|
|
return thp_size(page) >> inode->i_blkbits;
|
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif /* _LINUX_PAGEMAP_H */
|