2019-05-19 13:08:55 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* mm/readahead.c - address_space-level file readahead.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2002, Linus Torvalds
|
|
|
|
*
|
2008-10-15 22:01:59 -07:00
|
|
|
* 09Apr2002 Andrew Morton
|
2005-04-16 15:20:36 -07:00
|
|
|
* Initial version.
|
|
|
|
*/
|
|
|
|
|
2022-03-22 14:38:51 -07:00
|
|
|
/**
|
|
|
|
* DOC: Readahead Overview
|
|
|
|
*
|
|
|
|
* Readahead is used to read content into the page cache before it is
|
|
|
|
* explicitly requested by the application. Readahead only ever
|
2022-03-31 15:02:34 -04:00
|
|
|
* attempts to read folios that are not yet in the page cache. If a
|
|
|
|
* folio is present but not up-to-date, readahead will not try to read
|
2022-04-29 08:43:23 -04:00
|
|
|
* it. In that case a simple ->read_folio() will be requested.
|
2022-03-22 14:38:51 -07:00
|
|
|
*
|
|
|
|
* Readahead is triggered when an application read request (whether a
|
2022-03-31 15:02:34 -04:00
|
|
|
* system call or a page fault) finds that the requested folio is not in
|
2022-03-22 14:38:51 -07:00
|
|
|
* the page cache, or that it is in the page cache and has the
|
2022-03-31 15:02:34 -04:00
|
|
|
* readahead flag set. This flag indicates that the folio was read
|
|
|
|
* as part of a previous readahead request and now that it has been
|
|
|
|
* accessed, it is time for the next readahead.
|
2022-03-22 14:38:51 -07:00
|
|
|
*
|
|
|
|
* Each readahead request is partly synchronous read, and partly async
|
2022-03-31 15:02:34 -04:00
|
|
|
* readahead. This is reflected in the struct file_ra_state which
|
|
|
|
* contains ->size being the total number of pages, and ->async_size
|
|
|
|
* which is the number of pages in the async section. The readahead
|
|
|
|
* flag will be set on the first folio in this async section to trigger
|
|
|
|
* a subsequent readahead. Once a series of sequential reads has been
|
2022-03-22 14:38:51 -07:00
|
|
|
* established, there should be no need for a synchronous component and
|
2022-03-31 15:02:34 -04:00
|
|
|
* all readahead request will be fully asynchronous.
|
2022-03-22 14:38:51 -07:00
|
|
|
*
|
2022-03-31 15:02:34 -04:00
|
|
|
* When either of the triggers causes a readahead, three numbers need
|
|
|
|
* to be determined: the start of the region to read, the size of the
|
|
|
|
* region, and the size of the async tail.
|
2022-03-22 14:38:51 -07:00
|
|
|
*
|
|
|
|
* The start of the region is simply the first page address at or after
|
|
|
|
* the accessed address, which is not currently populated in the page
|
|
|
|
* cache. This is found with a simple search in the page cache.
|
|
|
|
*
|
|
|
|
* The size of the async tail is determined by subtracting the size that
|
|
|
|
* was explicitly requested from the determined request size, unless
|
|
|
|
* this would be less than zero - then zero is used. NOTE THIS
|
|
|
|
* CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED
|
2022-03-31 15:02:34 -04:00
|
|
|
* PAGE. ALSO THIS CALCULATION IS NOT USED CONSISTENTLY.
|
2022-03-22 14:38:51 -07:00
|
|
|
*
|
|
|
|
* The size of the region is normally determined from the size of the
|
|
|
|
* previous readahead which loaded the preceding pages. This may be
|
|
|
|
* discovered from the struct file_ra_state for simple sequential reads,
|
|
|
|
* or from examining the state of the page cache when multiple
|
|
|
|
* sequential reads are interleaved. Specifically: where the readahead
|
2022-03-31 15:02:34 -04:00
|
|
|
* was triggered by the readahead flag, the size of the previous
|
2022-03-22 14:38:51 -07:00
|
|
|
* readahead is assumed to be the number of pages from the triggering
|
|
|
|
* page to the start of the new readahead. In these cases, the size of
|
|
|
|
* the previous readahead is scaled, often doubled, for the new
|
|
|
|
* readahead, though see get_next_ra_size() for details.
|
|
|
|
*
|
|
|
|
* If the size of the previous read cannot be determined, the number of
|
|
|
|
* preceding pages in the page cache is used to estimate the size of
|
|
|
|
* a previous read. This estimate could easily be misled by random
|
|
|
|
* reads being coincidentally adjacent, so it is ignored unless it is
|
|
|
|
* larger than the current request, and it is not scaled up, unless it
|
|
|
|
* is at the start of file.
|
|
|
|
*
|
2022-03-31 15:02:34 -04:00
|
|
|
* In general readahead is accelerated at the start of the file, as
|
2022-03-22 14:38:51 -07:00
|
|
|
* reads from there are often sequential. There are other minor
|
2022-03-31 15:02:34 -04:00
|
|
|
* adjustments to the readahead size in various special cases and these
|
2022-03-22 14:38:51 -07:00
|
|
|
* are best discovered by reading the code.
|
|
|
|
*
|
2022-03-31 15:02:34 -04:00
|
|
|
* The above calculation, based on the previous readahead size,
|
|
|
|
* determines the size of the readahead, to which any requested read
|
|
|
|
* size may be added.
|
2022-03-22 14:38:51 -07:00
|
|
|
*
|
|
|
|
* Readahead requests are sent to the filesystem using the ->readahead()
|
|
|
|
* address space operation, for which mpage_readahead() is a canonical
|
|
|
|
* implementation. ->readahead() should normally initiate reads on all
|
2022-03-31 15:02:34 -04:00
|
|
|
* folios, but may fail to read any or all folios without causing an I/O
|
2022-04-29 08:43:23 -04:00
|
|
|
* error. The page cache reading code will issue a ->read_folio() request
|
2022-03-31 15:02:34 -04:00
|
|
|
* for any folio which ->readahead() did not read, and only an error
|
2022-03-22 14:38:51 -07:00
|
|
|
* from this will be final.
|
|
|
|
*
|
2022-03-31 15:02:34 -04:00
|
|
|
* ->readahead() will generally call readahead_folio() repeatedly to get
|
|
|
|
* each folio from those prepared for readahead. It may fail to read a
|
|
|
|
* folio by:
|
2022-03-22 14:38:51 -07:00
|
|
|
*
|
2022-03-31 15:02:34 -04:00
|
|
|
* * not calling readahead_folio() sufficiently many times, effectively
|
|
|
|
* ignoring some folios, as might be appropriate if the path to
|
2022-03-22 14:38:51 -07:00
|
|
|
* storage is congested.
|
|
|
|
*
|
2022-03-31 15:02:34 -04:00
|
|
|
* * failing to actually submit a read request for a given folio,
|
2022-03-22 14:38:51 -07:00
|
|
|
* possibly due to insufficient resources, or
|
|
|
|
*
|
|
|
|
* * getting an error during subsequent processing of a request.
|
|
|
|
*
|
2022-03-31 15:02:34 -04:00
|
|
|
* In the last two cases, the folio should be unlocked by the filesystem
|
|
|
|
* to indicate that the read attempt has failed. In the first case the
|
|
|
|
* folio will be unlocked by the VFS.
|
2022-03-22 14:38:51 -07:00
|
|
|
*
|
2022-03-31 15:02:34 -04:00
|
|
|
* Those folios not in the final ``async_size`` of the request should be
|
2022-03-22 14:38:51 -07:00
|
|
|
* considered to be important and ->readahead() should not fail them due
|
|
|
|
* to congestion or temporary resource unavailability, but should wait
|
|
|
|
* for necessary resources (e.g. memory or indexing information) to
|
2022-03-31 15:02:34 -04:00
|
|
|
* become available. Folios in the final ``async_size`` may be
|
2022-03-22 14:38:51 -07:00
|
|
|
* considered less urgent and failure to read them is more acceptable.
|
2022-03-31 15:02:34 -04:00
|
|
|
* In this case it is best to use filemap_remove_folio() to remove the
|
|
|
|
* folios from the page cache as is automatically done for folios that
|
|
|
|
* were not fetched with readahead_folio(). This will allow a
|
|
|
|
* subsequent synchronous readahead request to try them again. If they
|
2022-03-22 14:38:54 -07:00
|
|
|
* are left in the page cache, then they will be read individually using
|
2022-04-29 08:43:23 -04:00
|
|
|
* ->read_folio() which may be less efficient.
|
2022-03-22 14:38:51 -07:00
|
|
|
*/
|
|
|
|
|
2022-04-20 06:27:19 +02:00
|
|
|
#include <linux/blkdev.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/kernel.h>
|
2016-08-25 15:17:17 -07:00
|
|
|
#include <linux/dax.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
|
|
|
#include <linux/gfp.h>
|
2011-10-16 02:01:52 -04:00
|
|
|
#include <linux/export.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/backing-dev.h>
|
2006-12-10 02:19:40 -08:00
|
|
|
#include <linux/task_io_accounting_ops.h>
|
2007-09-21 09:19:54 +02:00
|
|
|
#include <linux/pagemap.h>
|
2022-09-15 10:41:56 +01:00
|
|
|
#include <linux/psi.h>
|
2012-05-29 15:06:43 -07:00
|
|
|
#include <linux/syscalls.h>
|
|
|
|
#include <linux/file.h>
|
2016-01-14 15:22:01 -08:00
|
|
|
#include <linux/mm_inline.h>
|
2018-07-03 11:15:03 -04:00
|
|
|
#include <linux/blk-cgroup.h>
|
2018-08-29 08:41:29 +03:00
|
|
|
#include <linux/fadvise.h>
|
2020-06-01 21:46:58 -07:00
|
|
|
#include <linux/sched/mm.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2014-04-07 15:37:55 -07:00
|
|
|
#include "internal.h"
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* Initialise a struct file's readahead state. Assumes that the caller has
|
|
|
|
* memset *ra to zero.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
|
|
|
|
{
|
2015-01-14 10:42:36 +01:00
|
|
|
ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
|
2007-10-16 01:24:33 -07:00
|
|
|
ra->prev_pos = -1;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2006-01-30 08:53:33 +00:00
|
|
|
EXPORT_SYMBOL_GPL(file_ra_state_init);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2022-03-31 05:35:55 -07:00
|
|
|
static void read_pages(struct readahead_control *rac)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2020-06-01 21:46:25 -07:00
|
|
|
const struct address_space_operations *aops = rac->mapping->a_ops;
|
2022-03-31 14:15:59 -04:00
|
|
|
struct folio *folio;
|
2010-04-19 10:04:38 +02:00
|
|
|
struct blk_plug plug;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2020-06-01 21:46:25 -07:00
|
|
|
if (!readahead_count(rac))
|
2022-03-31 05:35:55 -07:00
|
|
|
return;
|
2020-06-01 21:46:18 -07:00
|
|
|
|
2022-09-15 10:41:56 +01:00
|
|
|
if (unlikely(rac->_workingset))
|
|
|
|
psi_memstall_enter(&rac->_pflags);
|
2010-04-19 10:04:38 +02:00
|
|
|
blk_start_plug(&plug);
|
|
|
|
|
2020-06-01 21:46:44 -07:00
|
|
|
if (aops->readahead) {
|
|
|
|
aops->readahead(rac);
|
2022-03-22 14:38:54 -07:00
|
|
|
/*
|
2022-03-31 14:15:59 -04:00
|
|
|
* Clean up the remaining folios. The sizes in ->ra
|
2022-03-31 15:02:34 -04:00
|
|
|
* may be used to size the next readahead, so make sure
|
2022-03-22 14:38:54 -07:00
|
|
|
* they accurately reflect what happened.
|
|
|
|
*/
|
2022-03-31 14:15:59 -04:00
|
|
|
while ((folio = readahead_folio(rac)) != NULL) {
|
|
|
|
unsigned long nr = folio_nr_pages(folio);
|
|
|
|
|
2022-06-07 15:45:53 -04:00
|
|
|
folio_get(folio);
|
2022-03-31 14:15:59 -04:00
|
|
|
rac->ra->size -= nr;
|
|
|
|
if (rac->ra->async_size >= nr) {
|
|
|
|
rac->ra->async_size -= nr;
|
|
|
|
filemap_remove_folio(folio);
|
2022-03-22 14:38:54 -07:00
|
|
|
}
|
2022-03-31 14:15:59 -04:00
|
|
|
folio_unlock(folio);
|
2022-06-07 15:45:53 -04:00
|
|
|
folio_put(folio);
|
2020-06-01 21:46:44 -07:00
|
|
|
}
|
2020-06-01 21:46:40 -07:00
|
|
|
} else {
|
2022-04-29 08:43:23 -04:00
|
|
|
while ((folio = readahead_folio(rac)) != NULL)
|
2022-04-29 11:53:28 -04:00
|
|
|
aops->read_folio(rac->file, folio);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2010-04-19 10:04:38 +02:00
|
|
|
|
|
|
|
blk_finish_plug(&plug);
|
2022-09-15 10:41:56 +01:00
|
|
|
if (unlikely(rac->_workingset))
|
|
|
|
psi_memstall_leave(&rac->_pflags);
|
|
|
|
rac->_workingset = false;
|
2020-06-01 21:46:18 -07:00
|
|
|
|
2020-06-01 21:46:40 -07:00
|
|
|
BUG_ON(readahead_count(rac));
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2020-06-01 21:46:51 -07:00
|
|
|
/**
|
2020-10-15 20:06:14 -07:00
|
|
|
* page_cache_ra_unbounded - Start unchecked readahead.
|
|
|
|
* @ractl: Readahead control.
|
2020-06-01 21:46:51 -07:00
|
|
|
* @nr_to_read: The number of pages to read.
|
|
|
|
* @lookahead_size: Where to start the next readahead.
|
|
|
|
*
|
|
|
|
* This function is for filesystems to call when they want to start
|
|
|
|
* readahead beyond a file's stated i_size. This is almost certainly
|
|
|
|
* not the function you want to call. Use page_cache_async_readahead()
|
|
|
|
* or page_cache_sync_readahead() instead.
|
|
|
|
*
|
|
|
|
* Context: File is referenced by caller. Mutexes may be held by caller.
|
|
|
|
* May sleep, but will not reenter filesystem to reclaim memory.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
2020-10-15 20:06:14 -07:00
|
|
|
void page_cache_ra_unbounded(struct readahead_control *ractl,
|
|
|
|
unsigned long nr_to_read, unsigned long lookahead_size)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2020-10-15 20:06:14 -07:00
|
|
|
struct address_space *mapping = ractl->mapping;
|
2024-08-22 15:50:11 +02:00
|
|
|
unsigned long ra_folio_index, index = readahead_index(ractl);
|
2016-07-26 15:24:53 -07:00
|
|
|
gfp_t gfp_mask = readahead_gfp_mask(mapping);
|
2024-08-22 15:50:11 +02:00
|
|
|
unsigned long mark, i = 0;
|
|
|
|
unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2020-06-01 21:46:58 -07:00
|
|
|
/*
|
|
|
|
* Partway through the readahead operation, we will have added
|
|
|
|
* locked pages to the page cache, but will not yet have submitted
|
|
|
|
* them for I/O. Adding another page may need to allocate memory,
|
|
|
|
* which can trigger memory reclaim. Telling the VM we're in
|
|
|
|
* the middle of a filesystem operation will cause it to not
|
|
|
|
* touch file-backed pages, preventing a deadlock. Most (all?)
|
|
|
|
* filesystems already specify __GFP_NOFS in their mapping's
|
|
|
|
* gfp_mask, but let's be explicit here.
|
|
|
|
*/
|
|
|
|
unsigned int nofs = memalloc_nofs_save();
|
|
|
|
|
2021-01-28 19:19:45 +01:00
|
|
|
filemap_invalidate_lock_shared(mapping);
|
2024-08-22 15:50:11 +02:00
|
|
|
index = mapping_align_index(mapping, index);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* As iterator `i` is aligned to min_nrpages, round_up the
|
|
|
|
* difference between nr_to_read and lookahead_size to mark the
|
|
|
|
* index that only has lookahead or "async_region" to set the
|
|
|
|
* readahead flag.
|
|
|
|
*/
|
|
|
|
ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
|
|
|
|
min_nrpages);
|
|
|
|
mark = ra_folio_index - index;
|
|
|
|
nr_to_read += readahead_index(ractl) - index;
|
|
|
|
ractl->_index = index;
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* Preallocate as many pages as we will need.
|
|
|
|
*/
|
2024-08-22 15:50:11 +02:00
|
|
|
while (i < nr_to_read) {
|
2021-03-10 16:06:51 -05:00
|
|
|
struct folio *folio = xa_load(&mapping->i_pages, index + i);
|
2024-03-22 17:35:54 +08:00
|
|
|
int ret;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2021-03-10 16:06:51 -05:00
|
|
|
if (folio && !xa_is_value(folio)) {
|
2018-06-01 09:03:06 -07:00
|
|
|
/*
|
2020-06-01 21:46:54 -07:00
|
|
|
* Page already present? Kick off the current batch
|
|
|
|
* of contiguous pages before continuing with the
|
|
|
|
* next batch. This page may be the one we would
|
|
|
|
* have intended to mark as Readahead, but we don't
|
|
|
|
* have a stable reference to this page, and it's
|
|
|
|
* not worth getting one just for that.
|
2018-06-01 09:03:06 -07:00
|
|
|
*/
|
2022-03-31 05:35:55 -07:00
|
|
|
read_pages(ractl);
|
2024-08-22 15:50:11 +02:00
|
|
|
ractl->_index += min_nrpages;
|
|
|
|
i = ractl->_index + ractl->_nr_pages - index;
|
2005-04-16 15:20:36 -07:00
|
|
|
continue;
|
2018-06-01 09:03:06 -07:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2024-08-22 15:50:11 +02:00
|
|
|
folio = filemap_alloc_folio(gfp_mask,
|
|
|
|
mapping_min_folio_order(mapping));
|
2021-03-10 16:06:51 -05:00
|
|
|
if (!folio)
|
2005-04-16 15:20:36 -07:00
|
|
|
break;
|
2024-03-22 17:35:54 +08:00
|
|
|
|
|
|
|
ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
|
|
|
|
if (ret < 0) {
|
2021-03-10 16:06:51 -05:00
|
|
|
folio_put(folio);
|
2024-03-22 17:35:54 +08:00
|
|
|
if (ret == -ENOMEM)
|
|
|
|
break;
|
2022-03-31 05:35:55 -07:00
|
|
|
read_pages(ractl);
|
2024-08-22 15:50:11 +02:00
|
|
|
ractl->_index += min_nrpages;
|
|
|
|
i = ractl->_index + ractl->_nr_pages - index;
|
2020-06-01 21:46:40 -07:00
|
|
|
continue;
|
|
|
|
}
|
2024-08-22 15:50:11 +02:00
|
|
|
if (i == mark)
|
2021-03-10 16:06:51 -05:00
|
|
|
folio_set_readahead(folio);
|
2022-09-15 10:41:56 +01:00
|
|
|
ractl->_workingset |= folio_test_workingset(folio);
|
2024-08-22 15:50:11 +02:00
|
|
|
ractl->_nr_pages += min_nrpages;
|
|
|
|
i += min_nrpages;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-04-29 11:53:28 -04:00
|
|
|
* Now start the IO. We ignore I/O errors - if the folio is not
|
|
|
|
* uptodate then the caller will launch read_folio again, and
|
2005-04-16 15:20:36 -07:00
|
|
|
* will then handle the error.
|
|
|
|
*/
|
2022-03-31 05:35:55 -07:00
|
|
|
read_pages(ractl);
|
2021-01-28 19:19:45 +01:00
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
2020-06-01 21:46:58 -07:00
|
|
|
memalloc_nofs_restore(nofs);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2020-10-15 20:06:14 -07:00
|
|
|
EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
|
2020-06-01 21:46:51 -07:00
|
|
|
|
|
|
|
/*
|
2020-10-15 20:06:17 -07:00
|
|
|
* do_page_cache_ra() actually reads a chunk of disk. It allocates
|
2020-06-01 21:46:51 -07:00
|
|
|
* the pages first, then submits them for I/O. This avoids the very bad
|
|
|
|
* behaviour which would occur if page allocations are causing VM writeback.
|
|
|
|
* We really don't want to intermingle reads and writes like that.
|
|
|
|
*/
|
2021-07-24 23:26:14 -04:00
|
|
|
static void do_page_cache_ra(struct readahead_control *ractl,
|
2020-10-15 20:06:17 -07:00
|
|
|
unsigned long nr_to_read, unsigned long lookahead_size)
|
2020-06-01 21:46:51 -07:00
|
|
|
{
|
2020-10-15 20:06:17 -07:00
|
|
|
struct inode *inode = ractl->mapping->host;
|
|
|
|
unsigned long index = readahead_index(ractl);
|
2020-06-01 21:46:51 -07:00
|
|
|
loff_t isize = i_size_read(inode);
|
|
|
|
pgoff_t end_index; /* The last page we want to read */
|
|
|
|
|
|
|
|
if (isize == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
end_index = (isize - 1) >> PAGE_SHIFT;
|
|
|
|
if (index > end_index)
|
|
|
|
return;
|
|
|
|
/* Don't read past the page containing the last byte of the file */
|
|
|
|
if (nr_to_read > end_index - index)
|
|
|
|
nr_to_read = end_index - index + 1;
|
|
|
|
|
2020-10-15 20:06:17 -07:00
|
|
|
page_cache_ra_unbounded(ractl, nr_to_read, lookahead_size);
|
2020-06-01 21:46:51 -07:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Chunk the readahead into 2 megabyte units, so that we don't pin too much
|
|
|
|
* memory at once.
|
|
|
|
*/
|
2020-10-15 20:06:24 -07:00
|
|
|
void force_page_cache_ra(struct readahead_control *ractl,
|
2021-04-07 21:18:55 +01:00
|
|
|
unsigned long nr_to_read)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2020-10-15 20:06:24 -07:00
|
|
|
struct address_space *mapping = ractl->mapping;
|
2021-04-07 21:18:55 +01:00
|
|
|
struct file_ra_state *ra = ractl->ra;
|
mm: don't cap request size based on read-ahead setting
We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level. Turns out it is read-ahead capping
the request size, since we use 128K as the default setting. This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.
This patch introduces a bdi hint, io_pages. This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side. The latter is done to avoid reading ahead too much, if the
application asks for a huge read. With this patch, the kernel behaves
like the application expects.
Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 16:43:26 -08:00
|
|
|
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
|
2024-06-25 12:18:54 +02:00
|
|
|
unsigned long max_pages;
|
mm: don't cap request size based on read-ahead setting
We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level. Turns out it is read-ahead capping
the request size, since we use 128K as the default setting. This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.
This patch introduces a bdi hint, io_pages. This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side. The latter is done to avoid reading ahead too much, if the
application asks for a huge read. With this patch, the kernel behaves
like the application expects.
Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 16:43:26 -08:00
|
|
|
|
2022-04-29 11:53:28 -04:00
|
|
|
if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
|
2020-06-01 21:46:10 -07:00
|
|
|
return;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
mm: don't cap request size based on read-ahead setting
We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level. Turns out it is read-ahead capping
the request size, since we use 128K as the default setting. This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.
This patch introduces a bdi hint, io_pages. This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side. The latter is done to avoid reading ahead too much, if the
application asks for a huge read. With this patch, the kernel behaves
like the application expects.
Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 16:43:26 -08:00
|
|
|
/*
|
|
|
|
* If the request exceeds the readahead window, allow the read to
|
|
|
|
* be up to the optimal hardware IO size
|
|
|
|
*/
|
|
|
|
max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
|
2020-10-15 20:06:24 -07:00
|
|
|
nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
|
2005-04-16 15:20:36 -07:00
|
|
|
while (nr_to_read) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
|
|
|
unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
if (this_chunk > nr_to_read)
|
|
|
|
this_chunk = nr_to_read;
|
2020-10-15 20:06:24 -07:00
|
|
|
do_page_cache_ra(ractl, this_chunk, 0);
|
2014-01-29 14:05:51 -08:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
nr_to_read -= this_chunk;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-07-19 01:48:04 -07:00
|
|
|
/*
|
|
|
|
* Set the initial window size, round to next power of 2 and square
|
|
|
|
* for small size, x 4 for medium, and x 2 for large
|
|
|
|
* for 128k (32 page) max ra
|
2021-11-05 13:43:47 -07:00
|
|
|
* 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial
|
2007-07-19 01:48:04 -07:00
|
|
|
*/
|
|
|
|
static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
|
|
|
|
{
|
|
|
|
unsigned long newsize = roundup_pow_of_two(size);
|
|
|
|
|
|
|
|
if (newsize <= max / 32)
|
|
|
|
newsize = newsize * 4;
|
|
|
|
else if (newsize <= max / 4)
|
|
|
|
newsize = newsize * 2;
|
|
|
|
else
|
|
|
|
newsize = max;
|
|
|
|
|
|
|
|
return newsize;
|
|
|
|
}
|
|
|
|
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
/*
|
|
|
|
* Get the previous window size, ramp it up, and
|
|
|
|
* return it as the new window size.
|
|
|
|
*/
|
2007-07-19 01:48:04 -07:00
|
|
|
static unsigned long get_next_ra_size(struct file_ra_state *ra,
|
2018-12-28 00:33:34 -08:00
|
|
|
unsigned long max)
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
{
|
2007-07-19 01:48:08 -07:00
|
|
|
unsigned long cur = ra->size;
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
|
|
|
|
if (cur < max / 16)
|
2018-12-28 00:33:34 -08:00
|
|
|
return 4 * cur;
|
|
|
|
if (cur <= max / 2)
|
|
|
|
return 2 * cur;
|
|
|
|
return max;
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* On-demand readahead design.
|
|
|
|
*
|
|
|
|
* The fields in struct file_ra_state represent the most-recently-executed
|
|
|
|
* readahead attempt:
|
|
|
|
*
|
2007-07-19 01:48:08 -07:00
|
|
|
* |<----- async_size ---------|
|
|
|
|
* |------------------- size -------------------->|
|
|
|
|
* |==================#===========================|
|
|
|
|
* ^start ^page marked with PG_readahead
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
*
|
|
|
|
* To overlap application thinking time and disk I/O time, we do
|
|
|
|
* `readahead pipelining': Do not wait until the application consumed all
|
|
|
|
* readahead pages and stalled on the missing page at readahead_index;
|
2007-07-19 01:48:08 -07:00
|
|
|
* Instead, submit an asynchronous readahead I/O as soon as there are
|
|
|
|
* only async_size pages left in the readahead window. Normally async_size
|
|
|
|
* will be equal to size, for maximum pipelining.
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
*
|
|
|
|
* In interleaved sequential reads, concurrent streams on the same fd can
|
|
|
|
* be invalidating each other's readahead state. So we flag the new readahead
|
2007-07-19 01:48:08 -07:00
|
|
|
* page at (start+size-async_size) with PG_readahead, and use it as readahead
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
* indicator. The flag won't be set on already cached pages, to avoid the
|
|
|
|
* readahead-for-nothing fuss, saving pointless page cache lookups.
|
|
|
|
*
|
2007-10-16 01:24:33 -07:00
|
|
|
* prev_pos tracks the last visited byte in the _previous_ read request.
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
* It should be maintained by the caller, and will be used for detecting
|
|
|
|
* small random reads. Note that the readahead algorithm checks loosely
|
|
|
|
* for sequential patterns. Hence interleaved reads might be served as
|
|
|
|
* sequential ones.
|
|
|
|
*
|
|
|
|
* There is a special-case: if the first page which the application tries to
|
|
|
|
* read happens to be the first page of the file, it is assumed that a linear
|
|
|
|
* read is about to happen and the window is immediately set to the initial size
|
|
|
|
* based on I/O request size and the max_readahead.
|
|
|
|
*
|
|
|
|
* The code ramps up the readahead size aggressively at first, but slow down as
|
|
|
|
* it approaches max_readhead.
|
|
|
|
*/
|
|
|
|
|
2020-02-05 11:27:01 -05:00
|
|
|
static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
|
|
|
|
pgoff_t mark, unsigned int order, gfp_t gfp)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
struct folio *folio = filemap_alloc_folio(gfp, order);
|
|
|
|
|
|
|
|
if (!folio)
|
|
|
|
return -ENOMEM;
|
2024-01-04 09:58:39 +01:00
|
|
|
mark = round_down(mark, 1UL << order);
|
2022-04-27 17:01:28 -04:00
|
|
|
if (index == mark)
|
2020-02-05 11:27:01 -05:00
|
|
|
folio_set_readahead(folio);
|
|
|
|
err = filemap_add_folio(ractl->mapping, folio, index, gfp);
|
2022-09-15 10:41:56 +01:00
|
|
|
if (err) {
|
2020-02-05 11:27:01 -05:00
|
|
|
folio_put(folio);
|
2022-09-15 10:41:56 +01:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
ractl->_nr_pages += 1UL << order;
|
|
|
|
ractl->_workingset |= folio_test_workingset(folio);
|
|
|
|
return 0;
|
2020-02-05 11:27:01 -05:00
|
|
|
}
|
|
|
|
|
2021-07-24 23:26:14 -04:00
|
|
|
void page_cache_ra_order(struct readahead_control *ractl,
|
2020-02-05 11:27:01 -05:00
|
|
|
struct file_ra_state *ra, unsigned int new_order)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = ractl->mapping;
|
2024-06-25 12:18:53 +02:00
|
|
|
pgoff_t start = readahead_index(ractl);
|
|
|
|
pgoff_t index = start;
|
2024-08-22 15:50:11 +02:00
|
|
|
unsigned int min_order = mapping_min_folio_order(mapping);
|
2020-02-05 11:27:01 -05:00
|
|
|
pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
|
|
|
|
pgoff_t mark = index + ra->size - ra->async_size;
|
2024-04-26 19:29:38 +08:00
|
|
|
unsigned int nofs;
|
2020-02-05 11:27:01 -05:00
|
|
|
int err = 0;
|
|
|
|
gfp_t gfp = readahead_gfp_mask(mapping);
|
2024-08-22 15:50:11 +02:00
|
|
|
unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
|
2020-02-05 11:27:01 -05:00
|
|
|
|
2024-08-22 15:50:11 +02:00
|
|
|
/*
|
|
|
|
* Fallback when size < min_nrpages as each folio should be
|
|
|
|
* at least min_nrpages anyway.
|
|
|
|
*/
|
|
|
|
if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
|
2020-02-05 11:27:01 -05:00
|
|
|
goto fallback;
|
|
|
|
|
|
|
|
limit = min(limit, index + ra->size - 1);
|
|
|
|
|
2024-08-22 15:50:09 +02:00
|
|
|
if (new_order < mapping_max_folio_order(mapping))
|
2020-02-05 11:27:01 -05:00
|
|
|
new_order += 2;
|
2024-06-27 10:39:50 +10:00
|
|
|
|
2024-08-22 15:50:09 +02:00
|
|
|
new_order = min(mapping_max_folio_order(mapping), new_order);
|
2024-06-27 10:39:50 +10:00
|
|
|
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
|
2024-08-22 15:50:11 +02:00
|
|
|
new_order = max(new_order, min_order);
|
2020-02-05 11:27:01 -05:00
|
|
|
|
2024-04-26 19:29:38 +08:00
|
|
|
/* See comment in page_cache_ra_unbounded() */
|
|
|
|
nofs = memalloc_nofs_save();
|
2022-06-20 19:05:36 +10:00
|
|
|
filemap_invalidate_lock_shared(mapping);
|
2024-08-22 15:50:11 +02:00
|
|
|
/*
|
|
|
|
* If the new_order is greater than min_order and index is
|
|
|
|
* already aligned to new_order, then this will be noop as index
|
|
|
|
* aligned to new_order should also be aligned to min_order.
|
|
|
|
*/
|
|
|
|
ractl->_index = mapping_align_index(mapping, index);
|
|
|
|
index = readahead_index(ractl);
|
|
|
|
|
2020-02-05 11:27:01 -05:00
|
|
|
while (index <= limit) {
|
|
|
|
unsigned int order = new_order;
|
|
|
|
|
|
|
|
/* Align with smaller pages if needed */
|
2023-12-01 16:10:45 +00:00
|
|
|
if (index & ((1UL << order) - 1))
|
2020-02-05 11:27:01 -05:00
|
|
|
order = __ffs(index);
|
|
|
|
/* Don't allocate pages past EOF */
|
2024-08-22 15:50:11 +02:00
|
|
|
while (order > min_order && index + (1UL << order) - 1 > limit)
|
2023-12-01 16:10:45 +00:00
|
|
|
order--;
|
2020-02-05 11:27:01 -05:00
|
|
|
err = ra_alloc_folio(ractl, index, mark, order, gfp);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
index += 1UL << order;
|
|
|
|
}
|
|
|
|
|
2022-03-31 05:35:55 -07:00
|
|
|
read_pages(ractl);
|
2022-06-20 19:05:36 +10:00
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
2024-04-26 19:29:38 +08:00
|
|
|
memalloc_nofs_restore(nofs);
|
2020-02-05 11:27:01 -05:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If there were already pages in the page cache, then we may have
|
|
|
|
* left some gaps. Let the regular readahead code take care of this
|
|
|
|
* situation.
|
|
|
|
*/
|
|
|
|
if (!err)
|
|
|
|
return;
|
|
|
|
fallback:
|
2024-06-25 12:18:53 +02:00
|
|
|
do_page_cache_ra(ractl, ra->size - (index - start), ra->async_size);
|
2020-02-05 11:27:01 -05:00
|
|
|
}
|
|
|
|
|
2024-06-25 12:18:58 +02:00
|
|
|
static unsigned long ractl_max_pages(struct readahead_control *ractl,
|
|
|
|
unsigned long req_size)
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
{
|
2020-10-15 20:06:21 -07:00
|
|
|
struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
|
2024-06-25 12:18:58 +02:00
|
|
|
unsigned long max_pages = ractl->ra->ra_pages;
|
2009-06-16 15:31:33 -07:00
|
|
|
|
mm: don't cap request size based on read-ahead setting
We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level. Turns out it is read-ahead capping
the request size, since we use 128K as the default setting. This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.
This patch introduces a bdi hint, io_pages. This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side. The latter is done to avoid reading ahead too much, if the
application asks for a huge read. With this patch, the kernel behaves
like the application expects.
Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 16:43:26 -08:00
|
|
|
/*
|
|
|
|
* If the request exceeds the readahead window, allow the read to
|
|
|
|
* be up to the optimal hardware IO size
|
|
|
|
*/
|
|
|
|
if (req_size > max_pages && bdi->io_pages > max_pages)
|
|
|
|
max_pages = min(req_size, bdi->io_pages);
|
2024-06-25 12:18:58 +02:00
|
|
|
return max_pages;
|
|
|
|
}
|
mm: don't cap request size based on read-ahead setting
We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level. Turns out it is read-ahead capping
the request size, since we use 128K as the default setting. This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.
This patch introduces a bdi hint, io_pages. This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side. The latter is done to avoid reading ahead too much, if the
application asks for a huge read. With this patch, the kernel behaves
like the application expects.
Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 16:43:26 -08:00
|
|
|
|
2024-06-25 12:18:58 +02:00
|
|
|
void page_cache_sync_ra(struct readahead_control *ractl,
|
|
|
|
unsigned long req_count)
|
|
|
|
{
|
|
|
|
pgoff_t index = readahead_index(ractl);
|
|
|
|
bool do_forced_ra = ractl->file && (ractl->file->f_mode & FMODE_RANDOM);
|
|
|
|
struct file_ra_state *ra = ractl->ra;
|
2024-06-25 12:18:59 +02:00
|
|
|
unsigned long max_pages, contig_count;
|
|
|
|
pgoff_t prev_index, miss;
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
|
2007-10-16 01:24:34 -07:00
|
|
|
/*
|
2024-06-25 12:18:58 +02:00
|
|
|
* Even if readahead is disabled, issue this request as readahead
|
|
|
|
* as we'll need it to satisfy the requested range. The forced
|
|
|
|
* readahead will do the right thing and limit the read to just the
|
|
|
|
* requested range, which we'll set to 1 page for this case.
|
2007-10-16 01:24:34 -07:00
|
|
|
*/
|
2024-06-25 12:18:58 +02:00
|
|
|
if (!ra->ra_pages || blk_cgroup_congested()) {
|
|
|
|
if (!ractl->file)
|
2020-06-01 21:46:10 -07:00
|
|
|
return;
|
2024-06-25 12:18:58 +02:00
|
|
|
req_count = 1;
|
|
|
|
do_forced_ra = true;
|
|
|
|
}
|
2007-10-16 01:24:34 -07:00
|
|
|
|
2024-06-25 12:18:58 +02:00
|
|
|
/* be dumb */
|
|
|
|
if (do_forced_ra) {
|
|
|
|
force_page_cache_ra(ractl, req_count);
|
|
|
|
return;
|
2007-10-16 01:24:34 -07:00
|
|
|
}
|
|
|
|
|
2024-06-25 12:18:58 +02:00
|
|
|
max_pages = ractl_max_pages(ractl, req_count);
|
2024-06-25 12:19:00 +02:00
|
|
|
prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT;
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
/*
|
2024-06-25 12:19:00 +02:00
|
|
|
* A start of file, oversized read, or sequential cache miss:
|
2020-06-01 21:46:29 -07:00
|
|
|
* trivial case: (index - prev_index) == 1
|
|
|
|
* unaligned reads: (index - prev_index) == 0
|
2009-06-16 15:31:33 -07:00
|
|
|
*/
|
2024-06-25 12:19:00 +02:00
|
|
|
if (!index || req_count > max_pages || index - prev_index <= 1UL) {
|
|
|
|
ra->start = index;
|
|
|
|
ra->size = get_init_ra_size(req_count, max_pages);
|
|
|
|
ra->async_size = ra->size > req_count ? ra->size - req_count :
|
|
|
|
ra->size >> 1;
|
|
|
|
goto readit;
|
|
|
|
}
|
2009-06-16 15:31:33 -07:00
|
|
|
|
readahead: introduce context readahead algorithm
Introduce page cache context based readahead algorithm.
This is to better support concurrent read streams in general.
RATIONALE
---------
The current readahead algorithm detects interleaved reads in a _passive_ way.
Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
By checking for (offset == prev_offset + 1), it will discover the sequentialness
between 3,4 and between 1004,1005, and start doing sequential readahead for the
individual streams since page 4 and page 1005.
The context readahead algorithm guarantees to discover the sequentialness no
matter how the streams are interleaved. For the above example, it will start
sequential readahead since page 2 and 1002.
The trick is to poke for page @offset-1 in the page cache when it has no other
clues on the sequentialness of request @offset: if the current requenst belongs
to a sequential stream, that stream must have accessed page @offset-1 recently,
and the page will still be cached now. So if page @offset-1 is there, we can
take request @offset as a sequential access.
BENEFICIARIES
-------------
- strictly interleaved reads i.e. 1,1001,2,1002,3,1003,...
the current readahead will take them as silly random reads;
the context readahead will take them as two sequential streams.
- cooperative IO processes i.e. NFS and SCST
They create a thread pool, farming off (sequential) IO requests to different
threads which will be performing interleaved IO.
It was not easy(or possible) to reliably tell from file->f_ra all those
cooperative processes working on the same sequential stream, since they will
have different file->f_ra instances. And NFSD's file->f_ra is particularly
unusable, since their file objects are dynamically created for each request.
The nfsd does have code trying to restore the f_ra bits, but not satisfactory.
The new scheme is to detect the sequential pattern via looking up the page
cache, which provides one single and consistent view of the pages recently
accessed. That makes sequential detection for cooperative processes possible.
USER REPORT
-----------
Vladislav recommends the addition of context readahead as a result of his SCST
benchmarks. It leads to 6%~40% performance gains in various cases and achieves
equal performance in others. http://lkml.org/lkml/2009/3/19/239
OVERHEADS
---------
In theory, it introduces one extra page cache lookup per random read. However
the below benchmark shows context readahead to be slightly faster, wondering..
Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
each block size. The average throughputs are:
original ra context ra gain
4K random reads: 65.561MB/s 65.648MB/s +0.1%
16K random reads: 124.767MB/s 124.951MB/s +0.1%
64K random reads: 162.123MB/s 162.278MB/s +0.1%
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Vladislav Bolkhovitin <vst@vlnb.net>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 15:31:36 -07:00
|
|
|
/*
|
|
|
|
* Query the page cache and look for the traces(cached history pages)
|
|
|
|
* that a sequential stream would leave behind.
|
|
|
|
*/
|
2024-06-25 12:18:59 +02:00
|
|
|
rcu_read_lock();
|
|
|
|
miss = page_cache_prev_miss(ractl->mapping, index - 1, max_pages);
|
|
|
|
rcu_read_unlock();
|
|
|
|
contig_count = index - miss - 1;
|
2009-06-16 15:31:33 -07:00
|
|
|
/*
|
2024-06-25 12:18:59 +02:00
|
|
|
* Standalone, small random read. Read as is, and do not pollute the
|
|
|
|
* readahead state.
|
2009-06-16 15:31:33 -07:00
|
|
|
*/
|
2024-06-25 12:18:59 +02:00
|
|
|
if (contig_count <= req_count) {
|
|
|
|
do_page_cache_ra(ractl, req_count, 0);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* File cached from the beginning:
|
|
|
|
* it is a strong indication of long-run stream (or whole-file-read)
|
|
|
|
*/
|
|
|
|
if (miss == ULONG_MAX)
|
|
|
|
contig_count *= 2;
|
|
|
|
ra->start = index;
|
|
|
|
ra->size = min(contig_count + req_count, max_pages);
|
|
|
|
ra->async_size = 1;
|
2007-07-19 01:48:08 -07:00
|
|
|
readit:
|
2020-10-15 20:06:21 -07:00
|
|
|
ractl->_index = ra->start;
|
2024-06-25 12:18:58 +02:00
|
|
|
page_cache_ra_order(ractl, ra, 0);
|
2007-07-19 01:48:08 -07:00
|
|
|
}
|
2020-10-15 20:06:28 -07:00
|
|
|
EXPORT_SYMBOL_GPL(page_cache_sync_ra);
|
2007-07-19 01:48:08 -07:00
|
|
|
|
2020-10-15 20:06:28 -07:00
|
|
|
void page_cache_async_ra(struct readahead_control *ractl,
|
2021-05-27 12:30:54 -04:00
|
|
|
struct folio *folio, unsigned long req_count)
|
2007-07-19 01:48:08 -07:00
|
|
|
{
|
2024-06-25 12:18:58 +02:00
|
|
|
unsigned long max_pages;
|
|
|
|
struct file_ra_state *ra = ractl->ra;
|
|
|
|
pgoff_t index = readahead_index(ractl);
|
|
|
|
pgoff_t expected, start;
|
|
|
|
unsigned int order = folio_order(folio);
|
|
|
|
|
2022-03-31 15:02:34 -04:00
|
|
|
/* no readahead */
|
2024-06-25 12:18:58 +02:00
|
|
|
if (!ra->ra_pages)
|
2007-07-19 01:48:08 -07:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Same bit is used for PG_readahead and PG_reclaim.
|
|
|
|
*/
|
2021-05-27 12:30:54 -04:00
|
|
|
if (folio_test_writeback(folio))
|
2007-07-19 01:48:08 -07:00
|
|
|
return;
|
|
|
|
|
2021-05-27 12:30:54 -04:00
|
|
|
folio_clear_readahead(folio);
|
2007-07-19 01:48:08 -07:00
|
|
|
|
2018-07-03 11:15:03 -04:00
|
|
|
if (blk_cgroup_congested())
|
|
|
|
return;
|
|
|
|
|
2024-06-25 12:18:58 +02:00
|
|
|
max_pages = ractl_max_pages(ractl, req_count);
|
|
|
|
/*
|
|
|
|
* It's the expected callback index, assume sequential access.
|
|
|
|
* Ramp up sizes, and push forward the readahead window.
|
|
|
|
*/
|
|
|
|
expected = round_down(ra->start + ra->size - ra->async_size,
|
|
|
|
1UL << order);
|
|
|
|
if (index == expected) {
|
|
|
|
ra->start += ra->size;
|
|
|
|
ra->size = get_next_ra_size(ra, max_pages);
|
|
|
|
ra->async_size = ra->size;
|
|
|
|
goto readit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Hit a marked folio without valid readahead state.
|
|
|
|
* E.g. interleaved reads.
|
|
|
|
* Query the pagecache for async_size, which normally equals to
|
|
|
|
* readahead size. Ramp it up and use it as the new readahead size.
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
start = page_cache_next_miss(ractl->mapping, index + 1, max_pages);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
if (!start || start - index > max_pages)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ra->start = start;
|
|
|
|
ra->size = start - index; /* old async_size */
|
|
|
|
ra->size += req_count;
|
|
|
|
ra->size = get_next_ra_size(ra, max_pages);
|
|
|
|
ra->async_size = ra->size;
|
|
|
|
readit:
|
|
|
|
ractl->_index = ra->start;
|
|
|
|
page_cache_ra_order(ractl, ra, order);
|
readahead: on-demand readahead logic
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 01:48:01 -07:00
|
|
|
}
|
2020-10-15 20:06:28 -07:00
|
|
|
EXPORT_SYMBOL_GPL(page_cache_async_ra);
|
2012-05-29 15:06:43 -07:00
|
|
|
|
2018-03-19 17:51:36 +01:00
|
|
|
ssize_t ksys_readahead(int fd, loff_t offset, size_t count)
|
2012-05-29 15:06:43 -07:00
|
|
|
{
|
|
|
|
ssize_t ret;
|
2012-08-28 12:52:22 -04:00
|
|
|
struct fd f;
|
2012-05-29 15:06:43 -07:00
|
|
|
|
|
|
|
ret = -EBADF;
|
2012-08-28 12:52:22 -04:00
|
|
|
f = fdget(fd);
|
2024-05-31 14:12:01 -04:00
|
|
|
if (!fd_file(f) || !(fd_file(f)->f_mode & FMODE_READ))
|
2018-08-29 08:41:29 +03:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The readahead() syscall is intended to run only on files
|
|
|
|
* that can execute readahead. If readahead is not possible
|
|
|
|
* on this file, then we must return -EINVAL.
|
|
|
|
*/
|
|
|
|
ret = -EINVAL;
|
2024-05-31 14:12:01 -04:00
|
|
|
if (!fd_file(f)->f_mapping || !fd_file(f)->f_mapping->a_ops ||
|
|
|
|
(!S_ISREG(file_inode(fd_file(f))->i_mode) &&
|
|
|
|
!S_ISBLK(file_inode(fd_file(f))->i_mode)))
|
2018-08-29 08:41:29 +03:00
|
|
|
goto out;
|
|
|
|
|
2024-05-31 14:12:01 -04:00
|
|
|
ret = vfs_fadvise(fd_file(f), offset, count, POSIX_FADV_WILLNEED);
|
2018-08-29 08:41:29 +03:00
|
|
|
out:
|
|
|
|
fdput(f);
|
2012-05-29 15:06:43 -07:00
|
|
|
return ret;
|
|
|
|
}
|
2018-03-19 17:51:36 +01:00
|
|
|
|
|
|
|
SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count)
|
|
|
|
{
|
|
|
|
return ksys_readahead(fd, offset, count);
|
|
|
|
}
|
2020-09-10 14:03:27 +01:00
|
|
|
|
2022-04-05 15:13:05 +08:00
|
|
|
#if defined(CONFIG_COMPAT) && defined(__ARCH_WANT_COMPAT_READAHEAD)
|
|
|
|
COMPAT_SYSCALL_DEFINE4(readahead, int, fd, compat_arg_u64_dual(offset), size_t, count)
|
|
|
|
{
|
|
|
|
return ksys_readahead(fd, compat_arg_u64_glue(offset), count);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-09-10 14:03:27 +01:00
|
|
|
/**
|
|
|
|
* readahead_expand - Expand a readahead request
|
|
|
|
* @ractl: The request to be expanded
|
|
|
|
* @new_start: The revised start
|
|
|
|
* @new_len: The revised size of the request
|
|
|
|
*
|
|
|
|
* Attempt to expand a readahead request outwards from the current size to the
|
|
|
|
* specified size by inserting locked pages before and after the current window
|
|
|
|
* to increase the size to the new window. This may involve the insertion of
|
|
|
|
* THPs, in which case the window may get expanded even beyond what was
|
|
|
|
* requested.
|
|
|
|
*
|
|
|
|
* The algorithm will stop if it encounters a conflicting page already in the
|
|
|
|
* pagecache and leave a smaller expansion than requested.
|
|
|
|
*
|
|
|
|
* The caller must check for this by examining the revised @ractl object for a
|
|
|
|
* different expansion than was requested.
|
|
|
|
*/
|
|
|
|
void readahead_expand(struct readahead_control *ractl,
|
|
|
|
loff_t new_start, size_t new_len)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = ractl->mapping;
|
|
|
|
struct file_ra_state *ra = ractl->ra;
|
|
|
|
pgoff_t new_index, new_nr_pages;
|
|
|
|
gfp_t gfp_mask = readahead_gfp_mask(mapping);
|
2024-08-22 15:50:11 +02:00
|
|
|
unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
|
|
|
|
unsigned int min_order = mapping_min_folio_order(mapping);
|
2020-09-10 14:03:27 +01:00
|
|
|
|
|
|
|
new_index = new_start / PAGE_SIZE;
|
2024-08-22 15:50:11 +02:00
|
|
|
/*
|
|
|
|
* Readahead code should have aligned the ractl->_index to
|
|
|
|
* min_nrpages before calling readahead aops.
|
|
|
|
*/
|
|
|
|
VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
|
2020-09-10 14:03:27 +01:00
|
|
|
|
|
|
|
/* Expand the leading edge downwards */
|
|
|
|
while (ractl->_index > new_index) {
|
|
|
|
unsigned long index = ractl->_index - 1;
|
2023-01-16 19:39:41 +00:00
|
|
|
struct folio *folio = xa_load(&mapping->i_pages, index);
|
2020-09-10 14:03:27 +01:00
|
|
|
|
2023-01-16 19:39:41 +00:00
|
|
|
if (folio && !xa_is_value(folio))
|
|
|
|
return; /* Folio apparently present */
|
2020-09-10 14:03:27 +01:00
|
|
|
|
2024-08-22 15:50:11 +02:00
|
|
|
folio = filemap_alloc_folio(gfp_mask, min_order);
|
2023-01-16 19:39:41 +00:00
|
|
|
if (!folio)
|
2020-09-10 14:03:27 +01:00
|
|
|
return;
|
2024-08-22 15:50:11 +02:00
|
|
|
|
|
|
|
index = mapping_align_index(mapping, index);
|
2023-01-16 19:39:41 +00:00
|
|
|
if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
|
|
|
|
folio_put(folio);
|
2020-09-10 14:03:27 +01:00
|
|
|
return;
|
|
|
|
}
|
2023-01-16 19:39:41 +00:00
|
|
|
if (unlikely(folio_test_workingset(folio)) &&
|
|
|
|
!ractl->_workingset) {
|
|
|
|
ractl->_workingset = true;
|
|
|
|
psi_memstall_enter(&ractl->_pflags);
|
|
|
|
}
|
2024-08-22 15:50:11 +02:00
|
|
|
ractl->_nr_pages += min_nrpages;
|
2023-01-16 19:39:41 +00:00
|
|
|
ractl->_index = folio->index;
|
2020-09-10 14:03:27 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
new_len += new_start - readahead_pos(ractl);
|
|
|
|
new_nr_pages = DIV_ROUND_UP(new_len, PAGE_SIZE);
|
|
|
|
|
|
|
|
/* Expand the trailing edge upwards */
|
|
|
|
while (ractl->_nr_pages < new_nr_pages) {
|
|
|
|
unsigned long index = ractl->_index + ractl->_nr_pages;
|
2023-01-16 19:39:41 +00:00
|
|
|
struct folio *folio = xa_load(&mapping->i_pages, index);
|
2020-09-10 14:03:27 +01:00
|
|
|
|
2023-01-16 19:39:41 +00:00
|
|
|
if (folio && !xa_is_value(folio))
|
|
|
|
return; /* Folio apparently present */
|
2020-09-10 14:03:27 +01:00
|
|
|
|
2024-08-22 15:50:11 +02:00
|
|
|
folio = filemap_alloc_folio(gfp_mask, min_order);
|
2023-01-16 19:39:41 +00:00
|
|
|
if (!folio)
|
2020-09-10 14:03:27 +01:00
|
|
|
return;
|
2024-08-22 15:50:11 +02:00
|
|
|
|
|
|
|
index = mapping_align_index(mapping, index);
|
2023-01-16 19:39:41 +00:00
|
|
|
if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
|
|
|
|
folio_put(folio);
|
2020-09-10 14:03:27 +01:00
|
|
|
return;
|
|
|
|
}
|
2023-01-16 19:39:41 +00:00
|
|
|
if (unlikely(folio_test_workingset(folio)) &&
|
|
|
|
!ractl->_workingset) {
|
2022-09-15 10:41:56 +01:00
|
|
|
ractl->_workingset = true;
|
|
|
|
psi_memstall_enter(&ractl->_pflags);
|
|
|
|
}
|
2024-08-22 15:50:11 +02:00
|
|
|
ractl->_nr_pages += min_nrpages;
|
2020-09-10 14:03:27 +01:00
|
|
|
if (ra) {
|
2024-08-22 15:50:11 +02:00
|
|
|
ra->size += min_nrpages;
|
|
|
|
ra->async_size += min_nrpages;
|
2020-09-10 14:03:27 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(readahead_expand);
|