2019-05-27 08:55:06 +02:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-or-later
|
2013-07-10 16:05:03 -07:00
|
|
|
/*
|
|
|
|
* zswap.c - zswap driver file
|
|
|
|
*
|
2023-07-17 12:02:27 -04:00
|
|
|
* zswap is a cache that takes pages that are in the process
|
2013-07-10 16:05:03 -07:00
|
|
|
* of being swapped out and attempts to compress and store them in a
|
|
|
|
* RAM-based memory pool. This can result in a significant I/O reduction on
|
|
|
|
* the swap device and, in the case where decompressing from RAM is faster
|
|
|
|
* than reading from the swap device, can also improve workload performance.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2012 Seth Jennings <sjenning@linux.vnet.ibm.com>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
|
|
|
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/atomic.h>
|
|
|
|
#include <linux/rbtree.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/crypto.h>
|
2020-12-14 19:14:18 -08:00
|
|
|
#include <linux/scatterlist.h>
|
mempolicy: alloc_pages_mpol() for NUMA policy without vma
Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
principal actor for passing mempolicy choice down to __alloc_pages(),
rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).
vma_alloc_folio() and alloc_pages() remain, but as wrappers around
alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the
additional args to policy_nodemask(), which subsumes policy_node().
Cleanup throughout, cutting out some unhelpful "helpers".
It would all be much simpler without MPOL_INTERLEAVE, but that adds a
dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8bfd
("tmpfs: distribute interleave better across nodes"), which added ino bias
to the interleave, hidden from mm/mempolicy.c until this commit.
Hence "ilx" throughout, the "interleave index". Originally I thought it
could be done just with nid, but that's wrong: the nodemask may come from
the shared policy layer below a shmem vma, or it may come from the task
layer above a shmem vma; and without the final nodemask then nodeid cannot
be decided. And how ilx is applied depends also on page order.
The interleave index is almost always irrelevant unless MPOL_INTERLEAVE:
with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX
passed down from vma-less alloc_pages() is also used as hint not to use
THP-style hugepage allocation - to avoid the overhead of a hugepage arg
(though I don't understand why we never just added a GFP bit for THP - if
it actually needs a different allocation strategy from other pages of the
same order). vma_alloc_folio() still carries its hugepage arg here, but
it is not used, and should be removed when agreed.
get_vma_policy() no longer allows a NULL vma: over time I believe we've
eradicated all the places which used to need it e.g. swapoff and madvise
used to pass NULL vma to read_swap_cache_async(), but now know the vma.
[hughd@google.com: handle NULL mpol being passed to __read_swap_cache_async()]
Link: https://lkml.kernel.org/r/ea419956-4751-0102-21f7-9c93cb957892@google.com
Link: https://lkml.kernel.org/r/74e34633-6060-f5e3-aee-7040d43f2e93@google.com
Link: https://lkml.kernel.org/r/1738368e-bac0-fd11-ed7f-b87142a939fe@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun heo <tj@kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Domenico Cerasuolo <mimmocerasuolo@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-19 13:39:08 -07:00
|
|
|
#include <linux/mempolicy.h>
|
2013-07-10 16:05:03 -07:00
|
|
|
#include <linux/mempool.h>
|
2014-08-06 16:08:40 -07:00
|
|
|
#include <linux/zpool.h>
|
2020-12-14 19:14:18 -08:00
|
|
|
#include <crypto/acompress.h>
|
2023-07-17 12:02:27 -04:00
|
|
|
#include <linux/zswap.h>
|
2013-07-10 16:05:03 -07:00
|
|
|
#include <linux/mm_types.h>
|
|
|
|
#include <linux/page-flags.h>
|
|
|
|
#include <linux/swapops.h>
|
|
|
|
#include <linux/writeback.h>
|
|
|
|
#include <linux/pagemap.h>
|
2020-01-30 22:15:04 -08:00
|
|
|
#include <linux/workqueue.h>
|
2023-11-30 11:40:20 -08:00
|
|
|
#include <linux/list_lru.h>
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2022-05-09 18:20:47 -07:00
|
|
|
#include "swap.h"
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 20:32:27 +02:00
|
|
|
#include "internal.h"
|
2022-05-09 18:20:47 -07:00
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/*********************************
|
|
|
|
* statistics
|
|
|
|
**********************************/
|
2014-08-06 16:08:40 -07:00
|
|
|
/* Total bytes used by the compressed storage */
|
2022-05-19 14:08:53 -07:00
|
|
|
u64 zswap_pool_total_size;
|
2013-07-10 16:05:03 -07:00
|
|
|
/* The number of compressed pages currently stored in zswap */
|
2022-05-19 14:08:53 -07:00
|
|
|
atomic_t zswap_stored_pages = ATOMIC_INIT(0);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
/* The number of same-value filled pages currently stored in zswap */
|
|
|
|
static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);
|
2013-07-10 16:05:03 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The statistics below are not protected from concurrent access for
|
|
|
|
* performance reasons so they may not be a 100% accurate. However,
|
|
|
|
* they do provide useful information on roughly how many times a
|
|
|
|
* certain event is occurring.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Pool limit was hit (see zswap_max_pool_percent) */
|
|
|
|
static u64 zswap_pool_limit_hit;
|
|
|
|
/* Pages written back when pool limit was reached */
|
|
|
|
static u64 zswap_written_back_pages;
|
|
|
|
/* Store failed due to a reclaim failure after pool limit was reached */
|
|
|
|
static u64 zswap_reject_reclaim_fail;
|
2023-10-24 16:45:09 -07:00
|
|
|
/* Store failed due to compression algorithm failure */
|
|
|
|
static u64 zswap_reject_compress_fail;
|
2013-07-10 16:05:03 -07:00
|
|
|
/* Compressed page was too big for the allocator to (optimally) store */
|
|
|
|
static u64 zswap_reject_compress_poor;
|
|
|
|
/* Store failed because underlying allocator could not get memory */
|
|
|
|
static u64 zswap_reject_alloc_fail;
|
|
|
|
/* Store failed because the entry metadata could not be allocated (rare) */
|
|
|
|
static u64 zswap_reject_kmemcache_fail;
|
|
|
|
/* Duplicate store was encountered (rare) */
|
|
|
|
static u64 zswap_duplicate_entry;
|
|
|
|
|
2020-01-30 22:15:04 -08:00
|
|
|
/* Shrinker work queue */
|
|
|
|
static struct workqueue_struct *shrink_wq;
|
|
|
|
/* Pool limit was hit, we need to calm down */
|
|
|
|
static bool zswap_pool_reached_full;
|
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/*********************************
|
|
|
|
* tunables
|
|
|
|
**********************************/
|
2015-06-25 15:00:35 -07:00
|
|
|
|
2017-02-27 14:26:50 -08:00
|
|
|
#define ZSWAP_PARAM_UNSET ""
|
|
|
|
|
2023-04-03 20:13:18 +08:00
|
|
|
static int zswap_setup(void);
|
|
|
|
|
2020-04-06 20:08:03 -07:00
|
|
|
/* Enable/disable zswap */
|
|
|
|
static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON);
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 13:13:09 -08:00
|
|
|
static int zswap_enabled_param_set(const char *,
|
|
|
|
const struct kernel_param *);
|
2020-12-14 19:14:11 -08:00
|
|
|
static const struct kernel_param_ops zswap_enabled_param_ops = {
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 13:13:09 -08:00
|
|
|
.set = zswap_enabled_param_set,
|
|
|
|
.get = param_get_bool,
|
|
|
|
};
|
|
|
|
module_param_cb(enabled, &zswap_enabled_param_ops, &zswap_enabled, 0644);
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2015-09-09 15:35:21 -07:00
|
|
|
/* Crypto compressor to use */
|
2020-04-06 20:08:03 -07:00
|
|
|
static char *zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT;
|
2015-09-09 15:35:21 -07:00
|
|
|
static int zswap_compressor_param_set(const char *,
|
|
|
|
const struct kernel_param *);
|
2020-12-14 19:14:11 -08:00
|
|
|
static const struct kernel_param_ops zswap_compressor_param_ops = {
|
2015-09-09 15:35:21 -07:00
|
|
|
.set = zswap_compressor_param_set,
|
2015-11-06 16:29:15 -08:00
|
|
|
.get = param_get_charp,
|
|
|
|
.free = param_free_charp,
|
2015-09-09 15:35:21 -07:00
|
|
|
};
|
|
|
|
module_param_cb(compressor, &zswap_compressor_param_ops,
|
2015-11-06 16:29:15 -08:00
|
|
|
&zswap_compressor, 0644);
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2015-09-09 15:35:21 -07:00
|
|
|
/* Compressed storage zpool to use */
|
2020-04-06 20:08:03 -07:00
|
|
|
static char *zswap_zpool_type = CONFIG_ZSWAP_ZPOOL_DEFAULT;
|
2015-09-09 15:35:21 -07:00
|
|
|
static int zswap_zpool_param_set(const char *, const struct kernel_param *);
|
2020-12-14 19:14:11 -08:00
|
|
|
static const struct kernel_param_ops zswap_zpool_param_ops = {
|
2015-11-06 16:29:15 -08:00
|
|
|
.set = zswap_zpool_param_set,
|
|
|
|
.get = param_get_charp,
|
|
|
|
.free = param_free_charp,
|
2015-09-09 15:35:21 -07:00
|
|
|
};
|
2015-11-06 16:29:15 -08:00
|
|
|
module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644);
|
2014-08-06 16:08:40 -07:00
|
|
|
|
2015-09-09 15:35:21 -07:00
|
|
|
/* The maximum percentage of memory that the compressed pool can occupy */
|
|
|
|
static unsigned int zswap_max_pool_percent = 20;
|
|
|
|
module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644);
|
2014-04-07 15:38:27 -07:00
|
|
|
|
2020-01-30 22:15:04 -08:00
|
|
|
/* The threshold for accepting new pages after the max_pool_percent was hit */
|
|
|
|
static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
|
|
|
|
module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
|
|
|
|
uint, 0644);
|
|
|
|
|
2022-03-22 14:47:43 -07:00
|
|
|
/*
|
|
|
|
* Enable/disable handling same-value filled pages (enabled by default).
|
|
|
|
* If disabled every page is considered non-same-value filled.
|
|
|
|
*/
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
static bool zswap_same_filled_pages_enabled = true;
|
|
|
|
module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled,
|
|
|
|
bool, 0644);
|
|
|
|
|
2022-03-22 14:47:43 -07:00
|
|
|
/* Enable/disable handling non-same-value filled pages (enabled by default) */
|
|
|
|
static bool zswap_non_same_filled_pages_enabled = true;
|
|
|
|
module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
|
|
|
|
bool, 0644);
|
|
|
|
|
2023-06-07 19:51:43 +00:00
|
|
|
static bool zswap_exclusive_loads_enabled = IS_ENABLED(
|
|
|
|
CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON);
|
|
|
|
module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
|
|
|
|
|
2023-06-20 19:46:44 +00:00
|
|
|
/* Number of zpools in zswap_pool (empirically determined for scalability) */
|
|
|
|
#define ZSWAP_NR_ZPOOLS 32
|
|
|
|
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
/* Enable/disable memory pressure-based shrinker. */
|
|
|
|
static bool zswap_shrinker_enabled = IS_ENABLED(
|
|
|
|
CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
|
|
|
|
module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
|
|
|
|
|
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap to
save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail, and
do not write pages in the zswap pool back to the backing swap device (both
when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of its
parent.
Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
as it also disables the zswap shrinker, a major source of zswap
writebacks.
For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who are
using servers with really slow disk performance (for instance, massive
but slow HDDs). For these folks, it's impossible to convince them to
even entertain zswap if swapping also comes as a packaged deal.
Writeback disabling is quite a useful feature in these situations - on
a mixed workloads deployment, they can disable writeback for the more
IO-sensitive workloads, and enable writeback for other background
workloads.
For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):
zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out
zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out
(the last two lines are from vmstat -s).
[nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency]
Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Heidelberg <david@ixit.cz>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 11:24:06 -08:00
|
|
|
bool is_zswap_enabled(void)
|
|
|
|
{
|
|
|
|
return zswap_enabled;
|
|
|
|
}
|
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/*********************************
|
2015-09-09 15:35:19 -07:00
|
|
|
* data structures
|
2013-07-10 16:05:03 -07:00
|
|
|
**********************************/
|
|
|
|
|
2020-12-14 19:14:18 -08:00
|
|
|
struct crypto_acomp_ctx {
|
|
|
|
struct crypto_acomp *acomp;
|
|
|
|
struct acomp_req *req;
|
|
|
|
struct crypto_wait wait;
|
2023-12-28 09:45:46 +00:00
|
|
|
u8 *buffer;
|
|
|
|
struct mutex mutex;
|
2020-12-14 19:14:18 -08:00
|
|
|
};
|
|
|
|
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
/*
|
|
|
|
* The lock ordering is zswap_tree.lock -> zswap_pool.lru_lock.
|
|
|
|
* The only case where lru_lock is not acquired while holding tree.lock is
|
|
|
|
* when a zswap_entry is taken off the lru for writeback, in that case it
|
|
|
|
* needs to be verified that it's still valid in the tree.
|
|
|
|
*/
|
2015-09-09 15:35:19 -07:00
|
|
|
struct zswap_pool {
|
2023-06-20 19:46:44 +00:00
|
|
|
struct zpool *zpools[ZSWAP_NR_ZPOOLS];
|
2020-12-14 19:14:18 -08:00
|
|
|
struct crypto_acomp_ctx __percpu *acomp_ctx;
|
2015-09-09 15:35:19 -07:00
|
|
|
struct kref kref;
|
|
|
|
struct list_head list;
|
2020-01-30 22:15:04 -08:00
|
|
|
struct work_struct release_work;
|
|
|
|
struct work_struct shrink_work;
|
2016-11-27 00:13:40 +01:00
|
|
|
struct hlist_node node;
|
2015-09-09 15:35:19 -07:00
|
|
|
char tfm_name[CRYPTO_MAX_ALG_NAME];
|
2023-11-30 11:40:20 -08:00
|
|
|
struct list_lru list_lru;
|
|
|
|
struct mem_cgroup *next_shrink;
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
struct shrinker *shrinker;
|
|
|
|
atomic_t nr_stored;
|
2013-07-10 16:05:03 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* struct zswap_entry
|
|
|
|
*
|
|
|
|
* This structure contains the metadata for tracking a single compressed
|
|
|
|
* page within zswap.
|
|
|
|
*
|
|
|
|
* rbnode - links the entry into red-black tree for the appropriate swap type
|
2023-08-08 06:20:56 +00:00
|
|
|
* swpentry - associated swap entry, the offset indexes into the red-black tree
|
2013-07-10 16:05:03 -07:00
|
|
|
* refcount - the number of outstanding reference to the entry. This is needed
|
|
|
|
* to protect against premature freeing of the entry by code
|
2014-04-07 15:38:25 -07:00
|
|
|
* concurrent calls to load, invalidate, and writeback. The lock
|
2013-07-10 16:05:03 -07:00
|
|
|
* for the zswap_tree structure that contains the entry must
|
|
|
|
* be held while changing the refcount. Since the lock must
|
|
|
|
* be held, there is no reason to also make refcount atomic.
|
|
|
|
* length - the length in bytes of the compressed page data. Needed during
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
* decompression. For a same value filled page length is 0, and both
|
|
|
|
* pool and lru are invalid and must be ignored.
|
2015-09-09 15:35:19 -07:00
|
|
|
* pool - the zswap_pool the entry's data is in
|
|
|
|
* handle - zpool allocation handle that stores the compressed page data
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
* value - value of the same-value filled pages which have same content
|
2023-08-08 06:20:56 +00:00
|
|
|
* objcg - the obj_cgroup that the compressed memory is charged to
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
* lru - handle to the pool's lru used to evict pages.
|
2013-07-10 16:05:03 -07:00
|
|
|
*/
|
|
|
|
struct zswap_entry {
|
|
|
|
struct rb_node rbnode;
|
2023-06-12 11:38:15 +02:00
|
|
|
swp_entry_t swpentry;
|
2013-07-10 16:05:03 -07:00
|
|
|
int refcount;
|
|
|
|
unsigned int length;
|
2015-09-09 15:35:19 -07:00
|
|
|
struct zswap_pool *pool;
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
union {
|
|
|
|
unsigned long handle;
|
|
|
|
unsigned long value;
|
|
|
|
};
|
2022-05-19 14:08:53 -07:00
|
|
|
struct obj_cgroup *objcg;
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
struct list_head lru;
|
2013-07-10 16:05:03 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The tree lock in the zswap_tree struct protects a few things:
|
|
|
|
* - the rbtree
|
|
|
|
* - the refcount field of each entry in the tree
|
|
|
|
*/
|
|
|
|
struct zswap_tree {
|
|
|
|
struct rb_root rbroot;
|
|
|
|
spinlock_t lock;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
|
2024-01-19 11:22:23 +00:00
|
|
|
static unsigned int nr_zswap_trees[MAX_SWAPFILES];
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2015-09-09 15:35:19 -07:00
|
|
|
/* RCU-protected iteration */
|
|
|
|
static LIST_HEAD(zswap_pools);
|
|
|
|
/* protects zswap_pools list modification */
|
|
|
|
static DEFINE_SPINLOCK(zswap_pools_lock);
|
2016-05-05 16:22:23 -07:00
|
|
|
/* pool counter to provide unique names to zpool */
|
|
|
|
static atomic_t zswap_pools_count = ATOMIC_INIT(0);
|
2015-09-09 15:35:19 -07:00
|
|
|
|
2023-04-03 20:13:17 +08:00
|
|
|
enum zswap_init_type {
|
|
|
|
ZSWAP_UNINIT,
|
|
|
|
ZSWAP_INIT_SUCCEED,
|
|
|
|
ZSWAP_INIT_FAILED
|
|
|
|
};
|
2015-09-09 15:35:21 -07:00
|
|
|
|
2023-04-03 20:13:17 +08:00
|
|
|
static enum zswap_init_type zswap_init_state;
|
2015-09-09 15:35:21 -07:00
|
|
|
|
2023-04-03 20:13:18 +08:00
|
|
|
/* used to ensure the integrity of initialization */
|
|
|
|
static DEFINE_MUTEX(zswap_init_lock);
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 13:13:09 -08:00
|
|
|
|
2017-02-27 14:26:47 -08:00
|
|
|
/* init completed, but couldn't create the initial pool */
|
|
|
|
static bool zswap_has_pool;
|
|
|
|
|
2015-09-09 15:35:19 -07:00
|
|
|
/*********************************
|
|
|
|
* helpers and fwd declarations
|
|
|
|
**********************************/
|
|
|
|
|
2024-01-19 11:22:23 +00:00
|
|
|
static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
|
|
|
|
{
|
|
|
|
return &zswap_trees[swp_type(swp)][swp_offset(swp)
|
|
|
|
>> SWAP_ADDRESS_SPACE_SHIFT];
|
|
|
|
}
|
|
|
|
|
2015-09-09 15:35:19 -07:00
|
|
|
#define zswap_pool_debug(msg, p) \
|
|
|
|
pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \
|
2023-06-20 19:46:44 +00:00
|
|
|
zpool_get_type((p)->zpools[0]))
|
2015-09-09 15:35:19 -07:00
|
|
|
|
2023-06-12 11:38:15 +02:00
|
|
|
static int zswap_writeback_entry(struct zswap_entry *entry,
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
swp_entry_t swpentry);
|
2015-09-09 15:35:19 -07:00
|
|
|
|
|
|
|
static bool zswap_is_full(void)
|
|
|
|
{
|
2018-12-28 00:34:29 -08:00
|
|
|
return totalram_pages() * zswap_max_pool_percent / 100 <
|
|
|
|
DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE);
|
2015-09-09 15:35:19 -07:00
|
|
|
}
|
|
|
|
|
2020-01-30 22:15:04 -08:00
|
|
|
static bool zswap_can_accept(void)
|
|
|
|
{
|
|
|
|
return totalram_pages() * zswap_accept_thr_percent / 100 *
|
|
|
|
zswap_max_pool_percent / 100 >
|
|
|
|
DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE);
|
|
|
|
}
|
|
|
|
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
static u64 get_zswap_pool_size(struct zswap_pool *pool)
|
|
|
|
{
|
|
|
|
u64 pool_size = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ZSWAP_NR_ZPOOLS; i++)
|
|
|
|
pool_size += zpool_get_total_size(pool->zpools[i]);
|
|
|
|
|
|
|
|
return pool_size;
|
|
|
|
}
|
|
|
|
|
2015-09-09 15:35:19 -07:00
|
|
|
static void zswap_update_total_size(void)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
u64 total = 0;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(pool, &zswap_pools, list)
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
total += get_zswap_pool_size(pool);
|
2015-09-09 15:35:19 -07:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
zswap_pool_total_size = total;
|
|
|
|
}
|
|
|
|
|
2024-01-29 20:36:46 -05:00
|
|
|
/*********************************
|
|
|
|
* pool functions
|
|
|
|
**********************************/
|
|
|
|
|
|
|
|
static void zswap_alloc_shrinker(struct zswap_pool *pool);
|
|
|
|
static void shrink_worker(struct work_struct *w);
|
|
|
|
|
|
|
|
static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
char name[38]; /* 'zswap' + 32 char (max) num + \0 */
|
|
|
|
gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!zswap_has_pool) {
|
|
|
|
/* if either are unset, pool initialization failed, and we
|
|
|
|
* need both params to be set correctly before trying to
|
|
|
|
* create a pool.
|
|
|
|
*/
|
|
|
|
if (!strcmp(type, ZSWAP_PARAM_UNSET))
|
|
|
|
return NULL;
|
|
|
|
if (!strcmp(compressor, ZSWAP_PARAM_UNSET))
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
pool = kzalloc(sizeof(*pool), GFP_KERNEL);
|
|
|
|
if (!pool)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) {
|
|
|
|
/* unique name for each pool specifically required by zsmalloc */
|
|
|
|
snprintf(name, 38, "zswap%x",
|
|
|
|
atomic_inc_return(&zswap_pools_count));
|
|
|
|
|
|
|
|
pool->zpools[i] = zpool_create_pool(type, name, gfp);
|
|
|
|
if (!pool->zpools[i]) {
|
|
|
|
pr_err("%s zpool not available\n", type);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
pr_debug("using %s zpool\n", zpool_get_type(pool->zpools[0]));
|
|
|
|
|
|
|
|
strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
|
|
|
|
|
|
|
|
pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
|
|
|
|
if (!pool->acomp_ctx) {
|
|
|
|
pr_err("percpu alloc failed\n");
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
|
|
|
|
&pool->node);
|
|
|
|
if (ret)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
zswap_alloc_shrinker(pool);
|
|
|
|
if (!pool->shrinker)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
pr_debug("using %s compressor\n", pool->tfm_name);
|
|
|
|
|
|
|
|
/* being the current pool takes 1 ref; this func expects the
|
|
|
|
* caller to always add the new pool as the current pool
|
|
|
|
*/
|
|
|
|
kref_init(&pool->kref);
|
|
|
|
INIT_LIST_HEAD(&pool->list);
|
|
|
|
if (list_lru_init_memcg(&pool->list_lru, pool->shrinker))
|
|
|
|
goto lru_fail;
|
|
|
|
shrinker_register(pool->shrinker);
|
|
|
|
INIT_WORK(&pool->shrink_work, shrink_worker);
|
|
|
|
atomic_set(&pool->nr_stored, 0);
|
|
|
|
|
|
|
|
zswap_pool_debug("created", pool);
|
|
|
|
|
|
|
|
return pool;
|
|
|
|
|
|
|
|
lru_fail:
|
|
|
|
list_lru_destroy(&pool->list_lru);
|
|
|
|
shrinker_free(pool->shrinker);
|
|
|
|
error:
|
|
|
|
if (pool->acomp_ctx)
|
|
|
|
free_percpu(pool->acomp_ctx);
|
|
|
|
while (i--)
|
|
|
|
zpool_destroy_pool(pool->zpools[i]);
|
|
|
|
kfree(pool);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct zswap_pool *__zswap_pool_create_fallback(void)
|
|
|
|
{
|
|
|
|
bool has_comp, has_zpool;
|
|
|
|
|
|
|
|
has_comp = crypto_has_acomp(zswap_compressor, 0, 0);
|
|
|
|
if (!has_comp && strcmp(zswap_compressor,
|
|
|
|
CONFIG_ZSWAP_COMPRESSOR_DEFAULT)) {
|
|
|
|
pr_err("compressor %s not available, using default %s\n",
|
|
|
|
zswap_compressor, CONFIG_ZSWAP_COMPRESSOR_DEFAULT);
|
|
|
|
param_free_charp(&zswap_compressor);
|
|
|
|
zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT;
|
|
|
|
has_comp = crypto_has_acomp(zswap_compressor, 0, 0);
|
|
|
|
}
|
|
|
|
if (!has_comp) {
|
|
|
|
pr_err("default compressor %s not available\n",
|
|
|
|
zswap_compressor);
|
|
|
|
param_free_charp(&zswap_compressor);
|
|
|
|
zswap_compressor = ZSWAP_PARAM_UNSET;
|
|
|
|
}
|
|
|
|
|
|
|
|
has_zpool = zpool_has_pool(zswap_zpool_type);
|
|
|
|
if (!has_zpool && strcmp(zswap_zpool_type,
|
|
|
|
CONFIG_ZSWAP_ZPOOL_DEFAULT)) {
|
|
|
|
pr_err("zpool %s not available, using default %s\n",
|
|
|
|
zswap_zpool_type, CONFIG_ZSWAP_ZPOOL_DEFAULT);
|
|
|
|
param_free_charp(&zswap_zpool_type);
|
|
|
|
zswap_zpool_type = CONFIG_ZSWAP_ZPOOL_DEFAULT;
|
|
|
|
has_zpool = zpool_has_pool(zswap_zpool_type);
|
|
|
|
}
|
|
|
|
if (!has_zpool) {
|
|
|
|
pr_err("default zpool %s not available\n",
|
|
|
|
zswap_zpool_type);
|
|
|
|
param_free_charp(&zswap_zpool_type);
|
|
|
|
zswap_zpool_type = ZSWAP_PARAM_UNSET;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!has_comp || !has_zpool)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return zswap_pool_create(zswap_zpool_type, zswap_compressor);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_pool_destroy(struct zswap_pool *pool)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
zswap_pool_debug("destroying", pool);
|
|
|
|
|
|
|
|
shrinker_free(pool->shrinker);
|
|
|
|
cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
|
|
|
|
free_percpu(pool->acomp_ctx);
|
|
|
|
list_lru_destroy(&pool->list_lru);
|
|
|
|
|
|
|
|
spin_lock(&zswap_pools_lock);
|
|
|
|
mem_cgroup_iter_break(NULL, pool->next_shrink);
|
|
|
|
pool->next_shrink = NULL;
|
|
|
|
spin_unlock(&zswap_pools_lock);
|
|
|
|
|
|
|
|
for (i = 0; i < ZSWAP_NR_ZPOOLS; i++)
|
|
|
|
zpool_destroy_pool(pool->zpools[i]);
|
|
|
|
kfree(pool);
|
|
|
|
}
|
|
|
|
|
2024-01-29 20:36:47 -05:00
|
|
|
static void __zswap_pool_release(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool = container_of(work, typeof(*pool),
|
|
|
|
release_work);
|
|
|
|
|
|
|
|
synchronize_rcu();
|
|
|
|
|
|
|
|
/* nobody should have been able to get a kref... */
|
|
|
|
WARN_ON(kref_get_unless_zero(&pool->kref));
|
|
|
|
|
|
|
|
/* pool is now off zswap_pools list and has no references. */
|
|
|
|
zswap_pool_destroy(pool);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct zswap_pool *zswap_pool_current(void);
|
|
|
|
|
|
|
|
static void __zswap_pool_empty(struct kref *kref)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
|
|
|
|
pool = container_of(kref, typeof(*pool), kref);
|
|
|
|
|
|
|
|
spin_lock(&zswap_pools_lock);
|
|
|
|
|
|
|
|
WARN_ON(pool == zswap_pool_current());
|
|
|
|
|
|
|
|
list_del_rcu(&pool->list);
|
|
|
|
|
|
|
|
INIT_WORK(&pool->release_work, __zswap_pool_release);
|
|
|
|
schedule_work(&pool->release_work);
|
|
|
|
|
|
|
|
spin_unlock(&zswap_pools_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __must_check zswap_pool_get(struct zswap_pool *pool)
|
|
|
|
{
|
|
|
|
if (!pool)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return kref_get_unless_zero(&pool->kref);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_pool_put(struct zswap_pool *pool)
|
|
|
|
{
|
|
|
|
kref_put(&pool->kref, __zswap_pool_empty);
|
|
|
|
}
|
|
|
|
|
2024-01-29 20:36:48 -05:00
|
|
|
static struct zswap_pool *__zswap_pool_current(void)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
|
|
|
|
pool = list_first_or_null_rcu(&zswap_pools, typeof(*pool), list);
|
|
|
|
WARN_ONCE(!pool && zswap_has_pool,
|
|
|
|
"%s: no page storage pool!\n", __func__);
|
|
|
|
|
|
|
|
return pool;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct zswap_pool *zswap_pool_current(void)
|
|
|
|
{
|
|
|
|
assert_spin_locked(&zswap_pools_lock);
|
|
|
|
|
|
|
|
return __zswap_pool_current();
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct zswap_pool *zswap_pool_current_get(void)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
pool = __zswap_pool_current();
|
|
|
|
if (!zswap_pool_get(pool))
|
|
|
|
pool = NULL;
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return pool;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct zswap_pool *zswap_pool_last_get(void)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool, *last = NULL;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(pool, &zswap_pools, list)
|
|
|
|
last = pool;
|
|
|
|
WARN_ONCE(!last && zswap_has_pool,
|
|
|
|
"%s: no page storage pool!\n", __func__);
|
|
|
|
if (!zswap_pool_get(last))
|
|
|
|
last = NULL;
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return last;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* type and compressor must be null-terminated */
|
|
|
|
static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
|
|
|
|
assert_spin_locked(&zswap_pools_lock);
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(pool, &zswap_pools, list) {
|
|
|
|
if (strcmp(pool->tfm_name, compressor))
|
|
|
|
continue;
|
|
|
|
/* all zpools share the same type */
|
|
|
|
if (strcmp(zpool_get_type(pool->zpools[0]), type))
|
|
|
|
continue;
|
|
|
|
/* if we can't get it, it's about to be destroyed */
|
|
|
|
if (!zswap_pool_get(pool))
|
|
|
|
continue;
|
|
|
|
return pool;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2024-01-29 20:36:49 -05:00
|
|
|
/*********************************
|
|
|
|
* param callbacks
|
|
|
|
**********************************/
|
|
|
|
|
|
|
|
static bool zswap_pool_changed(const char *s, const struct kernel_param *kp)
|
|
|
|
{
|
|
|
|
/* no change required */
|
|
|
|
if (!strcmp(s, *(char **)kp->arg) && zswap_has_pool)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* val must be a null-terminated string */
|
|
|
|
static int __zswap_param_set(const char *val, const struct kernel_param *kp,
|
|
|
|
char *type, char *compressor)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool, *put_pool = NULL;
|
|
|
|
char *s = strstrip((char *)val);
|
|
|
|
int ret = 0;
|
|
|
|
bool new_pool = false;
|
|
|
|
|
|
|
|
mutex_lock(&zswap_init_lock);
|
|
|
|
switch (zswap_init_state) {
|
|
|
|
case ZSWAP_UNINIT:
|
|
|
|
/* if this is load-time (pre-init) param setting,
|
|
|
|
* don't create a pool; that's done during init.
|
|
|
|
*/
|
|
|
|
ret = param_set_charp(s, kp);
|
|
|
|
break;
|
|
|
|
case ZSWAP_INIT_SUCCEED:
|
|
|
|
new_pool = zswap_pool_changed(s, kp);
|
|
|
|
break;
|
|
|
|
case ZSWAP_INIT_FAILED:
|
|
|
|
pr_err("can't set param, initialization failed\n");
|
|
|
|
ret = -ENODEV;
|
|
|
|
}
|
|
|
|
mutex_unlock(&zswap_init_lock);
|
|
|
|
|
|
|
|
/* no need to create a new pool, return directly */
|
|
|
|
if (!new_pool)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (!type) {
|
|
|
|
if (!zpool_has_pool(s)) {
|
|
|
|
pr_err("zpool %s not available\n", s);
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
type = s;
|
|
|
|
} else if (!compressor) {
|
|
|
|
if (!crypto_has_acomp(s, 0, 0)) {
|
|
|
|
pr_err("compressor %s not available\n", s);
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
compressor = s;
|
|
|
|
} else {
|
|
|
|
WARN_ON(1);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&zswap_pools_lock);
|
|
|
|
|
|
|
|
pool = zswap_pool_find_get(type, compressor);
|
|
|
|
if (pool) {
|
|
|
|
zswap_pool_debug("using existing", pool);
|
|
|
|
WARN_ON(pool == zswap_pool_current());
|
|
|
|
list_del_rcu(&pool->list);
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock(&zswap_pools_lock);
|
|
|
|
|
|
|
|
if (!pool)
|
|
|
|
pool = zswap_pool_create(type, compressor);
|
|
|
|
|
|
|
|
if (pool)
|
|
|
|
ret = param_set_charp(s, kp);
|
|
|
|
else
|
|
|
|
ret = -EINVAL;
|
|
|
|
|
|
|
|
spin_lock(&zswap_pools_lock);
|
|
|
|
|
|
|
|
if (!ret) {
|
|
|
|
put_pool = zswap_pool_current();
|
|
|
|
list_add_rcu(&pool->list, &zswap_pools);
|
|
|
|
zswap_has_pool = true;
|
|
|
|
} else if (pool) {
|
|
|
|
/* add the possibly pre-existing pool to the end of the pools
|
|
|
|
* list; if it's new (and empty) then it'll be removed and
|
|
|
|
* destroyed by the put after we drop the lock
|
|
|
|
*/
|
|
|
|
list_add_tail_rcu(&pool->list, &zswap_pools);
|
|
|
|
put_pool = pool;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_unlock(&zswap_pools_lock);
|
|
|
|
|
|
|
|
if (!zswap_has_pool && !pool) {
|
|
|
|
/* if initial pool creation failed, and this pool creation also
|
|
|
|
* failed, maybe both compressor and zpool params were bad.
|
|
|
|
* Allow changing this param, so pool creation will succeed
|
|
|
|
* when the other param is changed. We already verified this
|
|
|
|
* param is ok in the zpool_has_pool() or crypto_has_acomp()
|
|
|
|
* checks above.
|
|
|
|
*/
|
|
|
|
ret = param_set_charp(s, kp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* drop the ref from either the old current pool,
|
|
|
|
* or the new pool we failed to add
|
|
|
|
*/
|
|
|
|
if (put_pool)
|
|
|
|
zswap_pool_put(put_pool);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int zswap_compressor_param_set(const char *val,
|
|
|
|
const struct kernel_param *kp)
|
|
|
|
{
|
|
|
|
return __zswap_param_set(val, kp, zswap_zpool_type, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int zswap_zpool_param_set(const char *val,
|
|
|
|
const struct kernel_param *kp)
|
|
|
|
{
|
|
|
|
return __zswap_param_set(val, kp, NULL, zswap_compressor);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int zswap_enabled_param_set(const char *val,
|
|
|
|
const struct kernel_param *kp)
|
|
|
|
{
|
|
|
|
int ret = -ENODEV;
|
|
|
|
|
|
|
|
/* if this is load-time (pre-init) param setting, only set param. */
|
|
|
|
if (system_state != SYSTEM_RUNNING)
|
|
|
|
return param_set_bool(val, kp);
|
|
|
|
|
|
|
|
mutex_lock(&zswap_init_lock);
|
|
|
|
switch (zswap_init_state) {
|
|
|
|
case ZSWAP_UNINIT:
|
|
|
|
if (zswap_setup())
|
|
|
|
break;
|
|
|
|
fallthrough;
|
|
|
|
case ZSWAP_INIT_SUCCEED:
|
|
|
|
if (!zswap_has_pool)
|
|
|
|
pr_err("can't enable, no pool configured\n");
|
|
|
|
else
|
|
|
|
ret = param_set_bool(val, kp);
|
|
|
|
break;
|
|
|
|
case ZSWAP_INIT_FAILED:
|
|
|
|
pr_err("can't enable, initialization failed\n");
|
|
|
|
}
|
|
|
|
mutex_unlock(&zswap_init_lock);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
/* should be called under RCU */
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
static inline struct mem_cgroup *mem_cgroup_from_entry(struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
return entry->objcg ? obj_cgroup_memcg(entry->objcg) : NULL;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline struct mem_cgroup *mem_cgroup_from_entry(struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static inline int entry_to_nid(struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
return page_to_nid(virt_to_page(entry));
|
|
|
|
}
|
|
|
|
|
|
|
|
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
|
|
|
|
/* lock out zswap pools list modification */
|
|
|
|
spin_lock(&zswap_pools_lock);
|
|
|
|
list_for_each_entry(pool, &zswap_pools, list) {
|
|
|
|
if (pool->next_shrink == memcg)
|
|
|
|
pool->next_shrink = mem_cgroup_iter(NULL, pool->next_shrink, NULL);
|
|
|
|
}
|
|
|
|
spin_unlock(&zswap_pools_lock);
|
|
|
|
}
|
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/*********************************
|
|
|
|
* zswap entry functions
|
|
|
|
**********************************/
|
|
|
|
static struct kmem_cache *zswap_entry_cache;
|
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
|
2013-07-10 16:05:03 -07:00
|
|
|
{
|
|
|
|
struct zswap_entry *entry;
|
2023-11-30 11:40:20 -08:00
|
|
|
entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
|
2013-07-10 16:05:03 -07:00
|
|
|
if (!entry)
|
|
|
|
return NULL;
|
|
|
|
entry->refcount = 1;
|
2013-11-12 15:08:27 -08:00
|
|
|
RB_CLEAR_NODE(&entry->rbnode);
|
2013-07-10 16:05:03 -07:00
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_entry_cache_free(struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
kmem_cache_free(zswap_entry_cache, entry);
|
|
|
|
}
|
|
|
|
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
/*********************************
|
|
|
|
* zswap lruvec functions
|
|
|
|
**********************************/
|
|
|
|
void zswap_lruvec_state_init(struct lruvec *lruvec)
|
|
|
|
{
|
|
|
|
atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0);
|
|
|
|
}
|
|
|
|
|
2023-12-13 21:58:30 +00:00
|
|
|
void zswap_folio_swapin(struct folio *folio)
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
{
|
|
|
|
struct lruvec *lruvec;
|
|
|
|
|
mm/swap_state: update zswap LRU's protection range with the folio locked
When a folio is swapped in, the protection size of the corresponding zswap
LRU is incremented, so that the zswap shrinker is more conservative with
its reclaiming action. This field is embedded within the struct lruvec,
so updating it requires looking up the folio's memcg and lruvec. However,
currently this lookup can happen after the folio is unlocked, for instance
if a new folio is allocated, and swap_read_folio() unlocks the folio
before returning. In this scenario, there is no stability guarantee for
the binding between a folio and its memcg and lruvec:
* A folio's memcg and lruvec can be freed between the lookup and the
update, leading to a UAF.
* Folio migration can clear the now-unlocked folio's memcg_data, which
directs the zswap LRU protection size update towards the root memcg
instead of the original memcg. This was recently picked up by the
syzbot thanks to a warning in the inlined folio_lruvec() call.
Move the zswap LRU protection range update above the swap_read_folio()
call, and only when a new page is allocated, to prevent this.
[nphamcs@gmail.com: add VM_WARN_ON_ONCE() to zswap_folio_swapin()]
Link: https://lkml.kernel.org/r/20240206180855.3987204-1-nphamcs@gmail.com
[nphamcs@gmail.com: remove unneeded if (folio) checks]
Link: https://lkml.kernel.org/r/20240206191355.83755-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20240205232442.3240571-1-nphamcs@gmail.com
Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
Reported-by: syzbot+17a611d10af7d18a7092@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000ae47f90610803260@google.com/
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-05 15:24:42 -08:00
|
|
|
VM_WARN_ON_ONCE(!folio_test_locked(folio));
|
|
|
|
lruvec = folio_lruvec(folio);
|
|
|
|
atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
}
|
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
/*********************************
|
|
|
|
* lru functions
|
|
|
|
**********************************/
|
|
|
|
static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
|
|
|
|
{
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
atomic_long_t *nr_zswap_protected;
|
|
|
|
unsigned long lru_size, old, new;
|
2023-11-30 11:40:20 -08:00
|
|
|
int nid = entry_to_nid(entry);
|
|
|
|
struct mem_cgroup *memcg;
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
struct lruvec *lruvec;
|
2023-11-30 11:40:20 -08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that it is safe to use rcu_read_lock() here, even in the face of
|
|
|
|
* concurrent memcg offlining. Thanks to the memcg->kmemcg_id indirection
|
|
|
|
* used in list_lru lookup, only two scenarios are possible:
|
|
|
|
*
|
|
|
|
* 1. list_lru_add() is called before memcg->kmemcg_id is updated. The
|
|
|
|
* new entry will be reparented to memcg's parent's list_lru.
|
|
|
|
* 2. list_lru_add() is called after memcg->kmemcg_id is updated. The
|
|
|
|
* new entry will be added directly to memcg's parent's list_lru.
|
|
|
|
*
|
2024-01-28 13:28:51 +00:00
|
|
|
* Similar reasoning holds for list_lru_del().
|
2023-11-30 11:40:20 -08:00
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
memcg = mem_cgroup_from_entry(entry);
|
|
|
|
/* will always succeed */
|
|
|
|
list_lru_add(list_lru, &entry->lru, nid, memcg);
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
|
|
|
|
/* Update the protection area */
|
|
|
|
lru_size = list_lru_count_one(list_lru, nid, memcg);
|
|
|
|
lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
|
|
|
|
nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
|
|
|
|
old = atomic_long_inc_return(nr_zswap_protected);
|
|
|
|
/*
|
|
|
|
* Decay to avoid overflow and adapt to changing workloads.
|
|
|
|
* This is based on LRU reclaim cost decaying heuristics.
|
|
|
|
*/
|
|
|
|
do {
|
|
|
|
new = old > lru_size / 4 ? old / 2 : old;
|
|
|
|
} while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));
|
2023-11-30 11:40:20 -08:00
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
int nid = entry_to_nid(entry);
|
|
|
|
struct mem_cgroup *memcg;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
memcg = mem_cgroup_from_entry(entry);
|
|
|
|
/* will always succeed */
|
|
|
|
list_lru_del(list_lru, &entry->lru, nid, memcg);
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/*********************************
|
|
|
|
* rbtree functions
|
|
|
|
**********************************/
|
|
|
|
static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
|
|
|
|
{
|
|
|
|
struct rb_node *node = root->rb_node;
|
|
|
|
struct zswap_entry *entry;
|
2023-06-12 11:38:15 +02:00
|
|
|
pgoff_t entry_offset;
|
2013-07-10 16:05:03 -07:00
|
|
|
|
|
|
|
while (node) {
|
|
|
|
entry = rb_entry(node, struct zswap_entry, rbnode);
|
2023-06-12 11:38:15 +02:00
|
|
|
entry_offset = swp_offset(entry->swpentry);
|
|
|
|
if (entry_offset > offset)
|
2013-07-10 16:05:03 -07:00
|
|
|
node = node->rb_left;
|
2023-06-12 11:38:15 +02:00
|
|
|
else if (entry_offset < offset)
|
2013-07-10 16:05:03 -07:00
|
|
|
node = node->rb_right;
|
|
|
|
else
|
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In the case that a entry with the same offset is found, a pointer to
|
|
|
|
* the existing entry is stored in dupentry and the function returns -EEXIST
|
|
|
|
*/
|
|
|
|
static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
|
|
|
|
struct zswap_entry **dupentry)
|
|
|
|
{
|
|
|
|
struct rb_node **link = &root->rb_node, *parent = NULL;
|
|
|
|
struct zswap_entry *myentry;
|
2023-06-12 11:38:15 +02:00
|
|
|
pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
|
2013-07-10 16:05:03 -07:00
|
|
|
|
|
|
|
while (*link) {
|
|
|
|
parent = *link;
|
|
|
|
myentry = rb_entry(parent, struct zswap_entry, rbnode);
|
2023-06-12 11:38:15 +02:00
|
|
|
myentry_offset = swp_offset(myentry->swpentry);
|
|
|
|
if (myentry_offset > entry_offset)
|
2013-07-10 16:05:03 -07:00
|
|
|
link = &(*link)->rb_left;
|
2023-06-12 11:38:15 +02:00
|
|
|
else if (myentry_offset < entry_offset)
|
2013-07-10 16:05:03 -07:00
|
|
|
link = &(*link)->rb_right;
|
|
|
|
else {
|
|
|
|
*dupentry = myentry;
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rb_link_node(&entry->rbnode, parent, link);
|
|
|
|
rb_insert_color(&entry->rbnode, root);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
mm: zswap: fix double invalidate with exclusive loads
If exclusive loads are enabled for zswap, we invalidate the entry before
returning from zswap_frontswap_load(), after dropping the local reference.
However, the tree lock is dropped during decompression after the local
reference is acquired, so the entry could be invalidated before we drop
the local ref. If this happens, the entry is freed once we drop the local
ref, and zswap_invalidate_entry() tries to invalidate an already freed
entry.
Fix this by:
(a) Making sure zswap_invalidate_entry() is always called with a local
ref held, to avoid being called on a freed entry.
(b) Making sure zswap_invalidate_entry() only drops the ref if the entry
was actually on the rbtree. Otherwise, another invalidation could
have already happened, and the initial ref is already dropped.
With these changes, there is no need to check that there is no need to
make sure the entry still exists in the tree in zswap_reclaim_entry()
before invalidating it, as zswap_reclaim_entry() will make this check
internally.
Link: https://lkml.kernel.org/r/20230621093009.637544-1-yosryahmed@google.com
Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-21 09:30:09 +00:00
|
|
|
static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
|
2013-11-12 15:08:27 -08:00
|
|
|
{
|
|
|
|
if (!RB_EMPTY_NODE(&entry->rbnode)) {
|
|
|
|
rb_erase(&entry->rbnode, root);
|
|
|
|
RB_CLEAR_NODE(&entry->rbnode);
|
mm: zswap: fix double invalidate with exclusive loads
If exclusive loads are enabled for zswap, we invalidate the entry before
returning from zswap_frontswap_load(), after dropping the local reference.
However, the tree lock is dropped during decompression after the local
reference is acquired, so the entry could be invalidated before we drop
the local ref. If this happens, the entry is freed once we drop the local
ref, and zswap_invalidate_entry() tries to invalidate an already freed
entry.
Fix this by:
(a) Making sure zswap_invalidate_entry() is always called with a local
ref held, to avoid being called on a freed entry.
(b) Making sure zswap_invalidate_entry() only drops the ref if the entry
was actually on the rbtree. Otherwise, another invalidation could
have already happened, and the initial ref is already dropped.
With these changes, there is no need to check that there is no need to
make sure the entry still exists in the tree in zswap_reclaim_entry()
before invalidating it, as zswap_reclaim_entry() will make this check
internally.
Link: https://lkml.kernel.org/r/20230621093009.637544-1-yosryahmed@google.com
Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-21 09:30:09 +00:00
|
|
|
return true;
|
2013-11-12 15:08:27 -08:00
|
|
|
}
|
mm: zswap: fix double invalidate with exclusive loads
If exclusive loads are enabled for zswap, we invalidate the entry before
returning from zswap_frontswap_load(), after dropping the local reference.
However, the tree lock is dropped during decompression after the local
reference is acquired, so the entry could be invalidated before we drop
the local ref. If this happens, the entry is freed once we drop the local
ref, and zswap_invalidate_entry() tries to invalidate an already freed
entry.
Fix this by:
(a) Making sure zswap_invalidate_entry() is always called with a local
ref held, to avoid being called on a freed entry.
(b) Making sure zswap_invalidate_entry() only drops the ref if the entry
was actually on the rbtree. Otherwise, another invalidation could
have already happened, and the initial ref is already dropped.
With these changes, there is no need to check that there is no need to
make sure the entry still exists in the tree in zswap_reclaim_entry()
before invalidating it, as zswap_reclaim_entry() will make this check
internally.
Link: https://lkml.kernel.org/r/20230621093009.637544-1-yosryahmed@google.com
Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-21 09:30:09 +00:00
|
|
|
return false;
|
2013-11-12 15:08:27 -08:00
|
|
|
}
|
|
|
|
|
2023-06-20 19:46:44 +00:00
|
|
|
static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
int i = 0;
|
|
|
|
|
|
|
|
if (ZSWAP_NR_ZPOOLS > 1)
|
|
|
|
i = hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS));
|
|
|
|
|
|
|
|
return entry->pool->zpools[i];
|
|
|
|
}
|
|
|
|
|
2013-11-12 15:08:27 -08:00
|
|
|
/*
|
2014-08-06 16:08:40 -07:00
|
|
|
* Carries out the common pattern of freeing and entry's zpool allocation,
|
2013-11-12 15:08:27 -08:00
|
|
|
* freeing the entry itself, and decrementing the number of stored pages.
|
|
|
|
*/
|
2024-01-29 20:36:37 -05:00
|
|
|
static void zswap_entry_free(struct zswap_entry *entry)
|
2013-11-12 15:08:27 -08:00
|
|
|
{
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
if (!entry->length)
|
|
|
|
atomic_dec(&zswap_same_filled_pages);
|
|
|
|
else {
|
2023-11-30 11:40:20 -08:00
|
|
|
zswap_lru_del(&entry->pool->list_lru, entry);
|
2023-06-20 19:46:44 +00:00
|
|
|
zpool_free(zswap_find_zpool(entry), entry->handle);
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
atomic_dec(&entry->pool->nr_stored);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
zswap_pool_put(entry->pool);
|
|
|
|
}
|
2024-01-29 20:34:38 -05:00
|
|
|
if (entry->objcg) {
|
|
|
|
obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
|
|
|
|
obj_cgroup_put(entry->objcg);
|
|
|
|
}
|
2013-11-12 15:08:27 -08:00
|
|
|
zswap_entry_cache_free(entry);
|
|
|
|
atomic_dec(&zswap_stored_pages);
|
2015-09-09 15:35:19 -07:00
|
|
|
zswap_update_total_size();
|
2013-11-12 15:08:27 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* caller must hold the tree lock */
|
|
|
|
static void zswap_entry_get(struct zswap_entry *entry)
|
|
|
|
{
|
2024-01-29 20:36:40 -05:00
|
|
|
WARN_ON_ONCE(!entry->refcount);
|
2013-11-12 15:08:27 -08:00
|
|
|
entry->refcount++;
|
|
|
|
}
|
|
|
|
|
2024-01-29 20:36:41 -05:00
|
|
|
/* caller must hold the tree lock */
|
2024-01-25 08:14:23 +00:00
|
|
|
static void zswap_entry_put(struct zswap_entry *entry)
|
2013-11-12 15:08:27 -08:00
|
|
|
{
|
2024-01-29 20:36:41 -05:00
|
|
|
WARN_ON_ONCE(!entry->refcount);
|
|
|
|
if (--entry->refcount == 0) {
|
2023-07-27 12:22:24 -04:00
|
|
|
WARN_ON_ONCE(!RB_EMPTY_NODE(&entry->rbnode));
|
2024-01-29 20:36:37 -05:00
|
|
|
zswap_entry_free(entry);
|
2013-11-12 15:08:27 -08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-01-29 20:36:39 -05:00
|
|
|
/*
|
|
|
|
* If the entry is still valid in the tree, drop the initial ref and remove it
|
|
|
|
* from the tree. This function must be called with an additional ref held,
|
|
|
|
* otherwise it may race with another invalidation freeing the entry.
|
|
|
|
*/
|
|
|
|
static void zswap_invalidate_entry(struct zswap_tree *tree,
|
|
|
|
struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
if (zswap_rb_erase(&tree->rbroot, entry))
|
|
|
|
zswap_entry_put(entry);
|
|
|
|
}
|
|
|
|
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
/*********************************
|
|
|
|
* shrinker functions
|
|
|
|
**********************************/
|
|
|
|
static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
|
|
|
|
spinlock_t *lock, void *arg);
|
|
|
|
|
|
|
|
static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
|
|
|
|
unsigned long shrink_ret, nr_protected, lru_size;
|
|
|
|
struct zswap_pool *pool = shrinker->private_data;
|
|
|
|
bool encountered_page_in_swapcache = false;
|
|
|
|
|
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap to
save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail, and
do not write pages in the zswap pool back to the backing swap device (both
when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of its
parent.
Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
as it also disables the zswap shrinker, a major source of zswap
writebacks.
For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who are
using servers with really slow disk performance (for instance, massive
but slow HDDs). For these folks, it's impossible to convince them to
even entertain zswap if swapping also comes as a packaged deal.
Writeback disabling is quite a useful feature in these situations - on
a mixed workloads deployment, they can disable writeback for the more
IO-sensitive workloads, and enable writeback for other background
workloads.
For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):
zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out
zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out
(the last two lines are from vmstat -s).
[nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency]
Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Heidelberg <david@ixit.cz>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 11:24:06 -08:00
|
|
|
if (!zswap_shrinker_enabled ||
|
|
|
|
!mem_cgroup_zswap_writeback_enabled(sc->memcg)) {
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
sc->nr_scanned = 0;
|
|
|
|
return SHRINK_STOP;
|
|
|
|
}
|
|
|
|
|
|
|
|
nr_protected =
|
|
|
|
atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
|
|
|
|
lru_size = list_lru_shrink_count(&pool->list_lru, sc);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Abort if we are shrinking into the protected region.
|
|
|
|
*
|
|
|
|
* This short-circuiting is necessary because if we have too many multiple
|
|
|
|
* concurrent reclaimers getting the freeable zswap object counts at the
|
|
|
|
* same time (before any of them made reasonable progress), the total
|
|
|
|
* number of reclaimed objects might be more than the number of unprotected
|
|
|
|
* objects (i.e the reclaimers will reclaim into the protected area of the
|
|
|
|
* zswap LRU).
|
|
|
|
*/
|
|
|
|
if (nr_protected >= lru_size - sc->nr_to_scan) {
|
|
|
|
sc->nr_scanned = 0;
|
|
|
|
return SHRINK_STOP;
|
|
|
|
}
|
|
|
|
|
|
|
|
shrink_ret = list_lru_shrink_walk(&pool->list_lru, sc, &shrink_memcg_cb,
|
|
|
|
&encountered_page_in_swapcache);
|
|
|
|
|
|
|
|
if (encountered_page_in_swapcache)
|
|
|
|
return SHRINK_STOP;
|
|
|
|
|
|
|
|
return shrink_ret ? shrink_ret : SHRINK_STOP;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool = shrinker->private_data;
|
|
|
|
struct mem_cgroup *memcg = sc->memcg;
|
|
|
|
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
|
|
|
|
unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
|
|
|
|
|
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap to
save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail, and
do not write pages in the zswap pool back to the backing swap device (both
when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of its
parent.
Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
as it also disables the zswap shrinker, a major source of zswap
writebacks.
For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who are
using servers with really slow disk performance (for instance, massive
but slow HDDs). For these folks, it's impossible to convince them to
even entertain zswap if swapping also comes as a packaged deal.
Writeback disabling is quite a useful feature in these situations - on
a mixed workloads deployment, they can disable writeback for the more
IO-sensitive workloads, and enable writeback for other background
workloads.
For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):
zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out
zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out
(the last two lines are from vmstat -s).
[nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency]
Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Heidelberg <david@ixit.cz>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 11:24:06 -08:00
|
|
|
if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
mm: memcg: restore subtree stats flushing
Stats flushing for memcg currently follows the following rules:
- Always flush the entire memcg hierarchy (i.e. flush the root).
- Only one flusher is allowed at a time. If someone else tries to flush
concurrently, they skip and return immediately.
- A periodic flusher flushes all the stats every 2 seconds.
The reason this approach is followed is because all flushes are serialized
by a global rstat spinlock. On the memcg side, flushing is invoked from
userspace reads as well as in-kernel flushers (e.g. reclaim, refault,
etc). This approach aims to avoid serializing all flushers on the global
lock, which can cause a significant performance hit under high
concurrency.
This approach has the following problems:
- Occasionally a userspace read of the stats of a non-root cgroup will
be too expensive as it has to flush the entire hierarchy [1].
- Sometimes the stats accuracy are compromised if there is an ongoing
flush, and we skip and return before the subtree of interest is
actually flushed, yielding stale stats (by up to 2s due to periodic
flushing). This is more visible when reading stats from userspace,
but can also affect in-kernel flushers.
The latter problem is particulary a concern when userspace reads stats
after an event occurs, but gets stats from before the event. Examples:
- When memory usage / pressure spikes, a userspace OOM handler may look
at the stats of different memcgs to select a victim based on various
heuristics (e.g. how much private memory will be freed by killing
this). Reading stale stats from before the usage spike in this case
may cause a wrongful OOM kill.
- A proactive reclaimer may read the stats after writing to
memory.reclaim to measure the success of the reclaim operation. Stale
stats from before reclaim may give a false negative.
- Reading the stats of a parent and a child memcg may be inconsistent
(child larger than parent), if the flush doesn't happen when the
parent is read, but happens when the child is read.
As for in-kernel flushers, they will occasionally get stale stats. No
regressions are currently known from this, but if there are regressions,
they would be very difficult to debug and link to the source of the
problem.
This patch aims to fix these problems by restoring subtree flushing, and
removing the unified/coalesced flushing logic that skips flushing if there
is an ongoing flush. This change would introduce a significant regression
with global stats flushing thresholds. With per-memcg stats flushing
thresholds, this seems to perform really well. The thresholds protect the
underlying lock from unnecessary contention.
This patch was tested in two ways to ensure the latency of flushing is
up to par, on a machine with 384 cpus:
- A synthetic test with 5000 concurrent workers in 500 cgroups doing
allocations and reclaim, as well as 1000 readers for memory.stat
(variation of [2]). No regressions were noticed in the total runtime.
Note that significant regressions in this test are observed with
global stats thresholds, but not with per-memcg thresholds.
- A synthetic stress test for concurrently reading memcg stats while
memory allocation/freeing workers are running in the background,
provided by Wei Xu [3]. With 250k threads reading the stats every
100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01%
of reads take more than 1ms, and no reads take more than 100ms.
[1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
[akpm@linux-foundation.org: fix mm/zswap.c]
[yosryahmed@google.com: remove stats flushing mutex]
Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com
Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-29 03:21:53 +00:00
|
|
|
mem_cgroup_flush_stats(memcg);
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
|
|
|
|
nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
|
|
|
|
#else
|
|
|
|
/* use pool stats instead of memcg stats */
|
|
|
|
nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
|
|
|
|
nr_stored = atomic_read(&pool->nr_stored);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (!nr_stored)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
nr_protected =
|
|
|
|
atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
|
|
|
|
nr_freeable = list_lru_shrink_count(&pool->list_lru, sc);
|
|
|
|
/*
|
|
|
|
* Subtract the lru size by an estimate of the number of pages
|
|
|
|
* that should be protected.
|
|
|
|
*/
|
|
|
|
nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Scale the number of freeable pages by the memory saving factor.
|
|
|
|
* This ensures that the better zswap compresses memory, the fewer
|
|
|
|
* pages we will evict to swap (as it will otherwise incur IO for
|
|
|
|
* relatively small memory saving).
|
|
|
|
*/
|
|
|
|
return mult_frac(nr_freeable, nr_backing, nr_stored);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_alloc_shrinker(struct zswap_pool *pool)
|
|
|
|
{
|
|
|
|
pool->shrinker =
|
|
|
|
shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, "mm-zswap");
|
|
|
|
if (!pool->shrinker)
|
|
|
|
return;
|
|
|
|
|
|
|
|
pool->shrinker->private_data = pool;
|
|
|
|
pool->shrinker->scan_objects = zswap_shrinker_scan;
|
|
|
|
pool->shrinker->count_objects = zswap_shrinker_count;
|
|
|
|
pool->shrinker->batch = 0;
|
|
|
|
pool->shrinker->seeks = DEFAULT_SEEKS;
|
|
|
|
}
|
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/*********************************
|
|
|
|
* per-cpu code
|
|
|
|
**********************************/
|
2016-11-27 00:13:40 +01:00
|
|
|
static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
|
2015-09-09 15:35:19 -07:00
|
|
|
{
|
2016-11-27 00:13:40 +01:00
|
|
|
struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
|
2020-12-14 19:14:18 -08:00
|
|
|
struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
|
|
|
|
struct crypto_acomp *acomp;
|
|
|
|
struct acomp_req *req;
|
2023-12-28 09:45:46 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
mutex_init(&acomp_ctx->mutex);
|
|
|
|
|
|
|
|
acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
|
|
|
|
if (!acomp_ctx->buffer)
|
|
|
|
return -ENOMEM;
|
2020-12-14 19:14:18 -08:00
|
|
|
|
|
|
|
acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
|
|
|
|
if (IS_ERR(acomp)) {
|
|
|
|
pr_err("could not alloc crypto acomp %s : %ld\n",
|
|
|
|
pool->tfm_name, PTR_ERR(acomp));
|
2023-12-28 09:45:46 +00:00
|
|
|
ret = PTR_ERR(acomp);
|
|
|
|
goto acomp_fail;
|
2020-12-14 19:14:18 -08:00
|
|
|
}
|
|
|
|
acomp_ctx->acomp = acomp;
|
2015-09-09 15:35:19 -07:00
|
|
|
|
2020-12-14 19:14:18 -08:00
|
|
|
req = acomp_request_alloc(acomp_ctx->acomp);
|
|
|
|
if (!req) {
|
|
|
|
pr_err("could not alloc crypto acomp_request %s\n",
|
|
|
|
pool->tfm_name);
|
2023-12-28 09:45:46 +00:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto req_fail;
|
2016-11-27 00:13:40 +01:00
|
|
|
}
|
2020-12-14 19:14:18 -08:00
|
|
|
acomp_ctx->req = req;
|
|
|
|
|
|
|
|
crypto_init_wait(&acomp_ctx->wait);
|
|
|
|
/*
|
|
|
|
* if the backend of acomp is async zip, crypto_req_done() will wakeup
|
|
|
|
* crypto_wait_req(); if the backend of acomp is scomp, the callback
|
|
|
|
* won't be called, crypto_wait_req() will return without blocking.
|
|
|
|
*/
|
|
|
|
acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
|
|
|
|
crypto_req_done, &acomp_ctx->wait);
|
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
return 0;
|
2023-12-28 09:45:46 +00:00
|
|
|
|
|
|
|
req_fail:
|
|
|
|
crypto_free_acomp(acomp_ctx->acomp);
|
|
|
|
acomp_fail:
|
|
|
|
kfree(acomp_ctx->buffer);
|
|
|
|
return ret;
|
2013-07-10 16:05:03 -07:00
|
|
|
}
|
|
|
|
|
2016-11-27 00:13:40 +01:00
|
|
|
static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
|
2015-09-09 15:35:19 -07:00
|
|
|
{
|
2016-11-27 00:13:40 +01:00
|
|
|
struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
|
2020-12-14 19:14:18 -08:00
|
|
|
struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
|
|
|
|
|
|
|
|
if (!IS_ERR_OR_NULL(acomp_ctx)) {
|
|
|
|
if (!IS_ERR_OR_NULL(acomp_ctx->req))
|
|
|
|
acomp_request_free(acomp_ctx->req);
|
|
|
|
if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
|
|
|
|
crypto_free_acomp(acomp_ctx->acomp);
|
2023-12-28 09:45:46 +00:00
|
|
|
kfree(acomp_ctx->buffer);
|
2020-12-14 19:14:18 -08:00
|
|
|
}
|
2015-09-09 15:35:19 -07:00
|
|
|
|
2016-11-27 00:13:40 +01:00
|
|
|
return 0;
|
2015-09-09 15:35:19 -07:00
|
|
|
}
|
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
|
|
|
|
spinlock_t *lock, void *arg)
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
{
|
2023-11-30 11:40:20 -08:00
|
|
|
struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
bool *encountered_page_in_swapcache = (bool *)arg;
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
swp_entry_t swpentry;
|
2023-11-30 11:40:20 -08:00
|
|
|
enum lru_status ret = LRU_REMOVED_RETRY;
|
|
|
|
int writeback_result;
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
/*
|
|
|
|
* Rotate the entry to the tail before unlocking the LRU,
|
|
|
|
* so that in case of an invalidation race concurrent
|
|
|
|
* reclaimers don't waste their time on it.
|
|
|
|
*
|
|
|
|
* If writeback succeeds, or failure is due to the entry
|
|
|
|
* being invalidated by the swap subsystem, the invalidation
|
|
|
|
* will unlink and free it.
|
|
|
|
*
|
|
|
|
* Temporary failures, where the same entry should be tried
|
|
|
|
* again immediately, almost never happen for this shrinker.
|
|
|
|
* We don't do any trylocking; -ENOMEM comes closest,
|
|
|
|
* but that's extremely rare and doesn't happen spuriously
|
|
|
|
* either. Don't bother distinguishing this case.
|
|
|
|
*
|
|
|
|
* But since they do exist in theory, the entry cannot just
|
|
|
|
* be unlinked, or we could leak it. Hence, rotate.
|
|
|
|
*/
|
|
|
|
list_move_tail(item, &l->list);
|
|
|
|
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
/*
|
|
|
|
* Once the lru lock is dropped, the entry might get freed. The
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
* swpentry is copied to the stack, and entry isn't deref'd again
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
* until the entry is verified to still be alive in the tree.
|
|
|
|
*/
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
swpentry = entry->swpentry;
|
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
/*
|
|
|
|
* It's safe to drop the lock here because we return either
|
|
|
|
* LRU_REMOVED_RETRY or LRU_RETRY.
|
|
|
|
*/
|
|
|
|
spin_unlock(lock);
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
writeback_result = zswap_writeback_entry(entry, swpentry);
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
if (writeback_result) {
|
|
|
|
zswap_reject_reclaim_fail++;
|
|
|
|
ret = LRU_RETRY;
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Encountering a page already in swap cache is a sign that we are shrinking
|
|
|
|
* into the warmer region. We should terminate shrinking (if we're in the dynamic
|
|
|
|
* shrinker context).
|
|
|
|
*/
|
2024-01-28 13:28:49 +00:00
|
|
|
if (writeback_result == -EEXIST && encountered_page_in_swapcache)
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
*encountered_page_in_swapcache = true;
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
} else {
|
|
|
|
zswap_written_back_pages++;
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
}
|
2023-11-30 11:40:21 -08:00
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
spin_lock(lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int shrink_memcg(struct mem_cgroup *memcg)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
int nid, shrunk = 0;
|
|
|
|
|
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap to
save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail, and
do not write pages in the zswap pool back to the backing swap device (both
when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of its
parent.
Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
as it also disables the zswap shrinker, a major source of zswap
writebacks.
For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who are
using servers with really slow disk performance (for instance, massive
but slow HDDs). For these folks, it's impossible to convince them to
even entertain zswap if swapping also comes as a packaged deal.
Writeback disabling is quite a useful feature in these situations - on
a mixed workloads deployment, they can disable writeback for the more
IO-sensitive workloads, and enable writeback for other background
workloads.
For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):
zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out
zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out
(the last two lines are from vmstat -s).
[nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency]
Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Heidelberg <david@ixit.cz>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 11:24:06 -08:00
|
|
|
if (!mem_cgroup_zswap_writeback_enabled(memcg))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
/*
|
|
|
|
* Skip zombies because their LRUs are reparented and we would be
|
|
|
|
* reclaiming from the parent instead of the dead memcg.
|
|
|
|
*/
|
|
|
|
if (memcg && !mem_cgroup_online(memcg))
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
pool = zswap_pool_current_get();
|
|
|
|
if (!pool)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
for_each_node_state(nid, N_NORMAL_MEMORY) {
|
|
|
|
unsigned long nr_to_walk = 1;
|
|
|
|
|
|
|
|
shrunk += list_lru_walk_one(&pool->list_lru, nid, memcg,
|
|
|
|
&shrink_memcg_cb, NULL, &nr_to_walk);
|
|
|
|
}
|
|
|
|
zswap_pool_put(pool);
|
|
|
|
return shrunk ? 0 : -EAGAIN;
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
}
|
|
|
|
|
2020-01-30 22:15:04 -08:00
|
|
|
static void shrink_worker(struct work_struct *w)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool = container_of(w, typeof(*pool),
|
|
|
|
shrink_work);
|
2023-11-30 11:40:20 -08:00
|
|
|
struct mem_cgroup *memcg;
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 20:32:27 +02:00
|
|
|
int ret, failures = 0;
|
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
/* global reclaim will select cgroup in a round-robin fashion. */
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 20:32:27 +02:00
|
|
|
do {
|
2023-11-30 11:40:20 -08:00
|
|
|
spin_lock(&zswap_pools_lock);
|
|
|
|
pool->next_shrink = mem_cgroup_iter(NULL, pool->next_shrink, NULL);
|
|
|
|
memcg = pool->next_shrink;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to retry if we have gone through a full round trip, or if we
|
|
|
|
* got an offline memcg (or else we risk undoing the effect of the
|
|
|
|
* zswap memcg offlining cleanup callback). This is not catastrophic
|
|
|
|
* per se, but it will keep the now offlined memcg hostage for a while.
|
|
|
|
*
|
|
|
|
* Note that if we got an online memcg, we will keep the extra
|
|
|
|
* reference in case the original reference obtained by mem_cgroup_iter
|
|
|
|
* is dropped by the zswap memcg offlining callback, ensuring that the
|
|
|
|
* memcg is not killed when we are reclaiming.
|
|
|
|
*/
|
|
|
|
if (!memcg) {
|
|
|
|
spin_unlock(&zswap_pools_lock);
|
|
|
|
if (++failures == MAX_RECLAIM_RETRIES)
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 20:32:27 +02:00
|
|
|
break;
|
2023-11-30 11:40:20 -08:00
|
|
|
|
|
|
|
goto resched;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!mem_cgroup_tryget_online(memcg)) {
|
|
|
|
/* drop the reference from mem_cgroup_iter() */
|
|
|
|
mem_cgroup_iter_break(NULL, memcg);
|
|
|
|
pool->next_shrink = NULL;
|
|
|
|
spin_unlock(&zswap_pools_lock);
|
|
|
|
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 20:32:27 +02:00
|
|
|
if (++failures == MAX_RECLAIM_RETRIES)
|
|
|
|
break;
|
2023-11-30 11:40:20 -08:00
|
|
|
|
|
|
|
goto resched;
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 20:32:27 +02:00
|
|
|
}
|
2023-11-30 11:40:20 -08:00
|
|
|
spin_unlock(&zswap_pools_lock);
|
|
|
|
|
|
|
|
ret = shrink_memcg(memcg);
|
|
|
|
/* drop the extra reference */
|
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
|
|
|
|
if (ret == -EINVAL)
|
|
|
|
break;
|
|
|
|
if (ret && ++failures == MAX_RECLAIM_RETRIES)
|
|
|
|
break;
|
|
|
|
|
|
|
|
resched:
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 20:32:27 +02:00
|
|
|
cond_resched();
|
|
|
|
} while (!zswap_can_accept());
|
2020-01-30 22:15:04 -08:00
|
|
|
zswap_pool_put(pool);
|
|
|
|
}
|
|
|
|
|
2024-01-29 20:36:43 -05:00
|
|
|
static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
struct crypto_acomp_ctx *acomp_ctx;
|
|
|
|
struct scatterlist input, output;
|
|
|
|
unsigned int dlen = PAGE_SIZE;
|
|
|
|
unsigned long handle;
|
|
|
|
struct zpool *zpool;
|
|
|
|
char *buf;
|
|
|
|
gfp_t gfp;
|
|
|
|
int ret;
|
|
|
|
u8 *dst;
|
|
|
|
|
|
|
|
acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
|
|
|
|
|
|
|
|
mutex_lock(&acomp_ctx->mutex);
|
|
|
|
|
|
|
|
dst = acomp_ctx->buffer;
|
|
|
|
sg_init_table(&input, 1);
|
|
|
|
sg_set_page(&input, &folio->page, PAGE_SIZE, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need PAGE_SIZE * 2 here since there maybe over-compression case,
|
|
|
|
* and hardware-accelerators may won't check the dst buffer size, so
|
|
|
|
* giving the dst buffer with enough length to avoid buffer overflow.
|
|
|
|
*/
|
|
|
|
sg_init_one(&output, dst, PAGE_SIZE * 2);
|
|
|
|
acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* it maybe looks a little bit silly that we send an asynchronous request,
|
|
|
|
* then wait for its completion synchronously. This makes the process look
|
|
|
|
* synchronous in fact.
|
|
|
|
* Theoretically, acomp supports users send multiple acomp requests in one
|
|
|
|
* acomp instance, then get those requests done simultaneously. but in this
|
|
|
|
* case, zswap actually does store and load page by page, there is no
|
|
|
|
* existing method to send the second page before the first page is done
|
|
|
|
* in one thread doing zwap.
|
|
|
|
* but in different threads running on different cpu, we have different
|
|
|
|
* acomp instance, so multiple threads can do (de)compression in parallel.
|
|
|
|
*/
|
|
|
|
ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
|
|
|
|
dlen = acomp_ctx->req->dlen;
|
|
|
|
if (ret) {
|
|
|
|
zswap_reject_compress_fail++;
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
zpool = zswap_find_zpool(entry);
|
|
|
|
gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
|
|
|
|
if (zpool_malloc_support_movable(zpool))
|
|
|
|
gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
|
|
|
|
ret = zpool_malloc(zpool, dlen, gfp, &handle);
|
|
|
|
if (ret == -ENOSPC) {
|
|
|
|
zswap_reject_compress_poor++;
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
if (ret) {
|
|
|
|
zswap_reject_alloc_fail++;
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
|
|
|
|
memcpy(buf, dst, dlen);
|
|
|
|
zpool_unmap_handle(zpool, handle);
|
|
|
|
|
|
|
|
entry->handle = handle;
|
|
|
|
entry->length = dlen;
|
|
|
|
|
|
|
|
unlock:
|
|
|
|
mutex_unlock(&acomp_ctx->mutex);
|
|
|
|
return ret == 0;
|
|
|
|
}
|
|
|
|
|
2024-01-29 20:36:42 -05:00
|
|
|
static void zswap_decompress(struct zswap_entry *entry, struct page *page)
|
2023-12-28 09:45:43 +00:00
|
|
|
{
|
|
|
|
struct zpool *zpool = zswap_find_zpool(entry);
|
|
|
|
struct scatterlist input, output;
|
|
|
|
struct crypto_acomp_ctx *acomp_ctx;
|
|
|
|
u8 *src;
|
|
|
|
|
|
|
|
acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
|
2023-12-28 09:45:46 +00:00
|
|
|
mutex_lock(&acomp_ctx->mutex);
|
2023-12-28 09:45:43 +00:00
|
|
|
|
|
|
|
src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
|
|
|
|
if (!zpool_can_sleep_mapped(zpool)) {
|
2023-12-28 09:45:46 +00:00
|
|
|
memcpy(acomp_ctx->buffer, src, entry->length);
|
|
|
|
src = acomp_ctx->buffer;
|
2023-12-28 09:45:43 +00:00
|
|
|
zpool_unmap_handle(zpool, entry->handle);
|
|
|
|
}
|
|
|
|
|
|
|
|
sg_init_one(&input, src, entry->length);
|
|
|
|
sg_init_table(&output, 1);
|
|
|
|
sg_set_page(&output, page, PAGE_SIZE, 0);
|
|
|
|
acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
|
|
|
|
BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
|
|
|
|
BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
|
2023-12-28 09:45:46 +00:00
|
|
|
mutex_unlock(&acomp_ctx->mutex);
|
2023-12-28 09:45:43 +00:00
|
|
|
|
|
|
|
if (zpool_can_sleep_mapped(zpool))
|
|
|
|
zpool_unmap_handle(zpool, entry->handle);
|
|
|
|
}
|
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/*********************************
|
|
|
|
* writeback code
|
|
|
|
**********************************/
|
|
|
|
/*
|
2023-12-13 21:58:30 +00:00
|
|
|
* Attempts to free an entry by adding a folio to the swap cache,
|
|
|
|
* decompressing the entry data into the folio, and issuing a
|
|
|
|
* bio write to write the folio back to the swap device.
|
2013-07-10 16:05:03 -07:00
|
|
|
*
|
2023-12-13 21:58:30 +00:00
|
|
|
* This can be thought of as a "resumed writeback" of the folio
|
2013-07-10 16:05:03 -07:00
|
|
|
* to the swap device. We are basically resuming the same swap
|
2023-07-17 12:02:27 -04:00
|
|
|
* writeback path that was intercepted with the zswap_store()
|
2023-12-13 21:58:30 +00:00
|
|
|
* in the first place. After the folio has been decompressed into
|
2013-07-10 16:05:03 -07:00
|
|
|
* the swap cache, the compressed version stored by zswap can be
|
|
|
|
* freed.
|
|
|
|
*/
|
2023-06-12 11:38:15 +02:00
|
|
|
static int zswap_writeback_entry(struct zswap_entry *entry,
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
swp_entry_t swpentry)
|
2013-07-10 16:05:03 -07:00
|
|
|
{
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
struct zswap_tree *tree;
|
2023-12-13 21:58:30 +00:00
|
|
|
struct folio *folio;
|
mempolicy: alloc_pages_mpol() for NUMA policy without vma
Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
principal actor for passing mempolicy choice down to __alloc_pages(),
rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).
vma_alloc_folio() and alloc_pages() remain, but as wrappers around
alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the
additional args to policy_nodemask(), which subsumes policy_node().
Cleanup throughout, cutting out some unhelpful "helpers".
It would all be much simpler without MPOL_INTERLEAVE, but that adds a
dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8bfd
("tmpfs: distribute interleave better across nodes"), which added ino bias
to the interleave, hidden from mm/mempolicy.c until this commit.
Hence "ilx" throughout, the "interleave index". Originally I thought it
could be done just with nid, but that's wrong: the nodemask may come from
the shared policy layer below a shmem vma, or it may come from the task
layer above a shmem vma; and without the final nodemask then nodeid cannot
be decided. And how ilx is applied depends also on page order.
The interleave index is almost always irrelevant unless MPOL_INTERLEAVE:
with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX
passed down from vma-less alloc_pages() is also used as hint not to use
THP-style hugepage allocation - to avoid the overhead of a hugepage arg
(though I don't understand why we never just added a GFP bit for THP - if
it actually needs a different allocation strategy from other pages of the
same order). vma_alloc_folio() still carries its hugepage arg here, but
it is not used, and should be removed when agreed.
get_vma_policy() no longer allows a NULL vma: over time I believe we've
eradicated all the places which used to need it e.g. swapoff and madvise
used to pass NULL vma to read_swap_cache_async(), but now know the vma.
[hughd@google.com: handle NULL mpol being passed to __read_swap_cache_async()]
Link: https://lkml.kernel.org/r/ea419956-4751-0102-21f7-9c93cb957892@google.com
Link: https://lkml.kernel.org/r/74e34633-6060-f5e3-aee-7040d43f2e93@google.com
Link: https://lkml.kernel.org/r/1738368e-bac0-fd11-ed7f-b87142a939fe@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun heo <tj@kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Domenico Cerasuolo <mimmocerasuolo@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-19 13:39:08 -07:00
|
|
|
struct mempolicy *mpol;
|
2023-12-13 21:58:30 +00:00
|
|
|
bool folio_was_allocated;
|
2013-07-10 16:05:03 -07:00
|
|
|
struct writeback_control wbc = {
|
|
|
|
.sync_mode = WB_SYNC_NONE,
|
|
|
|
};
|
|
|
|
|
2023-12-13 21:58:30 +00:00
|
|
|
/* try to allocate swap cache folio */
|
mempolicy: alloc_pages_mpol() for NUMA policy without vma
Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
principal actor for passing mempolicy choice down to __alloc_pages(),
rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).
vma_alloc_folio() and alloc_pages() remain, but as wrappers around
alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the
additional args to policy_nodemask(), which subsumes policy_node().
Cleanup throughout, cutting out some unhelpful "helpers".
It would all be much simpler without MPOL_INTERLEAVE, but that adds a
dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8bfd
("tmpfs: distribute interleave better across nodes"), which added ino bias
to the interleave, hidden from mm/mempolicy.c until this commit.
Hence "ilx" throughout, the "interleave index". Originally I thought it
could be done just with nid, but that's wrong: the nodemask may come from
the shared policy layer below a shmem vma, or it may come from the task
layer above a shmem vma; and without the final nodemask then nodeid cannot
be decided. And how ilx is applied depends also on page order.
The interleave index is almost always irrelevant unless MPOL_INTERLEAVE:
with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX
passed down from vma-less alloc_pages() is also used as hint not to use
THP-style hugepage allocation - to avoid the overhead of a hugepage arg
(though I don't understand why we never just added a GFP bit for THP - if
it actually needs a different allocation strategy from other pages of the
same order). vma_alloc_folio() still carries its hugepage arg here, but
it is not used, and should be removed when agreed.
get_vma_policy() no longer allows a NULL vma: over time I believe we've
eradicated all the places which used to need it e.g. swapoff and madvise
used to pass NULL vma to read_swap_cache_async(), but now know the vma.
[hughd@google.com: handle NULL mpol being passed to __read_swap_cache_async()]
Link: https://lkml.kernel.org/r/ea419956-4751-0102-21f7-9c93cb957892@google.com
Link: https://lkml.kernel.org/r/74e34633-6060-f5e3-aee-7040d43f2e93@google.com
Link: https://lkml.kernel.org/r/1738368e-bac0-fd11-ed7f-b87142a939fe@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun heo <tj@kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Domenico Cerasuolo <mimmocerasuolo@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-19 13:39:08 -07:00
|
|
|
mpol = get_task_policy(current);
|
2023-12-13 21:58:30 +00:00
|
|
|
folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
|
|
|
|
NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
|
|
|
|
if (!folio)
|
2023-12-28 09:45:45 +00:00
|
|
|
return -ENOMEM;
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2023-12-28 09:45:45 +00:00
|
|
|
/*
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
* Found an existing folio, we raced with swapin or concurrent
|
|
|
|
* shrinker. We generally writeback cold folios from zswap, and
|
|
|
|
* swapin means the folio just became hot, so skip this folio.
|
|
|
|
* For unlikely concurrent shrinker case, it will be unlinked
|
|
|
|
* and freed when invalidated by the concurrent shrinker anyway.
|
2023-12-28 09:45:45 +00:00
|
|
|
*/
|
2023-12-13 21:58:30 +00:00
|
|
|
if (!folio_was_allocated) {
|
|
|
|
folio_put(folio);
|
2023-12-28 09:45:45 +00:00
|
|
|
return -EEXIST;
|
2023-07-27 12:22:25 -04:00
|
|
|
}
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2023-07-27 12:22:25 -04:00
|
|
|
/*
|
2023-12-13 21:58:30 +00:00
|
|
|
* folio is locked, and the swapcache is now secured against
|
2023-07-27 12:22:25 -04:00
|
|
|
* concurrent swapping to and from the slot. Verify that the
|
|
|
|
* swap entry hasn't been invalidated and recycled behind our
|
|
|
|
* backs (our zswap_entry reference doesn't prevent that), to
|
2023-12-13 21:58:30 +00:00
|
|
|
* avoid overwriting a new swap folio with old compressed data.
|
2023-07-27 12:22:25 -04:00
|
|
|
*/
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
tree = swap_zswap_tree(swpentry);
|
2023-07-27 12:22:25 -04:00
|
|
|
spin_lock(&tree->lock);
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
|
mm: fix zswap writeback race condition
The zswap writeback mechanism can cause a race condition resulting in
memory corruption, where a swapped out page gets swapped in with data that
was written to a different page.
The race unfolds like this:
1. a page with data A and swap offset X is stored in zswap
2. page A is removed off the LRU by zpool driver for writeback in
zswap-shrink work, data for A is mapped by zpool driver
3. user space program faults and invalidates page entry A, offset X is
considered free
4. kswapd stores page B at offset X in zswap (zswap could also be
full, if so, page B would then be IOed to X, then skip step 5.)
5. entry A is replaced by B in tree->rbroot, this doesn't affect the
local reference held by zswap-shrink work
6. zswap-shrink work writes back A at X, and frees zswap entry A
7. swapin of slot X brings A in memory instead of B
The fix:
Once the swap page cache has been allocated (case ZSWAP_SWAPCACHE_NEW),
zswap-shrink work just checks that the local zswap_entry reference is
still the same as the one in the tree. If it's not the same it means that
it's either been invalidated or replaced, in both cases the writeback is
aborted because the local entry contains stale data.
Reproducer:
I originally found this by running `stress` overnight to validate my work
on the zswap writeback mechanism, it manifested after hours on my test
machine. The key to make it happen is having zswap writebacks, so
whatever setup pumps /sys/kernel/debug/zswap/written_back_pages should do
the trick.
In order to reproduce this faster on a vm, I setup a system with ~100M of
available memory and a 500M swap file, then running `stress --vm 1
--vm-bytes 300000000 --vm-stride 4000` makes it happen in matter of tens
of minutes. One can speed things up even more by swinging
/sys/module/zswap/parameters/max_pool_percent up and down between, say, 20
and 1; this makes it reproduce in tens of seconds. It's crucial to set
`--vm-stride` to something other than 4096 otherwise `stress` won't
realize that memory has been corrupted because all pages would have the
same data.
Link: https://lkml.kernel.org/r/20230503151200.19707-1-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chris Li (Google) <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 17:12:00 +02:00
|
|
|
spin_unlock(&tree->lock);
|
2023-12-13 21:58:30 +00:00
|
|
|
delete_from_swap_cache(folio);
|
2024-01-25 08:51:27 +00:00
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
2023-12-28 09:45:45 +00:00
|
|
|
return -ENOMEM;
|
2023-07-27 12:22:25 -04:00
|
|
|
}
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
|
|
|
|
/* Safe to deref entry after the entry is verified above. */
|
|
|
|
zswap_entry_get(entry);
|
2023-07-27 12:22:25 -04:00
|
|
|
spin_unlock(&tree->lock);
|
mm: fix zswap writeback race condition
The zswap writeback mechanism can cause a race condition resulting in
memory corruption, where a swapped out page gets swapped in with data that
was written to a different page.
The race unfolds like this:
1. a page with data A and swap offset X is stored in zswap
2. page A is removed off the LRU by zpool driver for writeback in
zswap-shrink work, data for A is mapped by zpool driver
3. user space program faults and invalidates page entry A, offset X is
considered free
4. kswapd stores page B at offset X in zswap (zswap could also be
full, if so, page B would then be IOed to X, then skip step 5.)
5. entry A is replaced by B in tree->rbroot, this doesn't affect the
local reference held by zswap-shrink work
6. zswap-shrink work writes back A at X, and frees zswap entry A
7. swapin of slot X brings A in memory instead of B
The fix:
Once the swap page cache has been allocated (case ZSWAP_SWAPCACHE_NEW),
zswap-shrink work just checks that the local zswap_entry reference is
still the same as the one in the tree. If it's not the same it means that
it's either been invalidated or replaced, in both cases the writeback is
aborted because the local entry contains stale data.
Reproducer:
I originally found this by running `stress` overnight to validate my work
on the zswap writeback mechanism, it manifested after hours on my test
machine. The key to make it happen is having zswap writebacks, so
whatever setup pumps /sys/kernel/debug/zswap/written_back_pages should do
the trick.
In order to reproduce this faster on a vm, I setup a system with ~100M of
available memory and a 500M swap file, then running `stress --vm 1
--vm-bytes 300000000 --vm-stride 4000` makes it happen in matter of tens
of minutes. One can speed things up even more by swinging
/sys/module/zswap/parameters/max_pool_percent up and down between, say, 20
and 1; this makes it reproduce in tens of seconds. It's crucial to set
`--vm-stride` to something other than 4096 otherwise `stress` won't
realize that memory has been corrupted because all pages would have the
same data.
Link: https://lkml.kernel.org/r/20230503151200.19707-1-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chris Li (Google) <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-03 17:12:00 +02:00
|
|
|
|
2024-01-29 20:36:42 -05:00
|
|
|
zswap_decompress(entry, &folio->page);
|
2023-07-27 12:22:25 -04:00
|
|
|
|
mm/zswap: fix race between lru writeback and swapoff
LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)
The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.
We can fix it by moving the swap cache allocation ahead before referencing
the tree, then check invalidate race with tree lock, only after that we
can safely deref the entry. Note we couldn't deref entry or tree anymore
after we unlock the folio, since we depend on this to hold on swapoff.
So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache and if
returned with folio locked we can reference the tree safely. Then we can
check invalidate race with tree lock, the following things is much the
same like zswap_load().
Since we can't deref the entry after zswap_writeback_entry(), we can't use
zswap_lru_putback() anymore, instead we rotate the entry in the beginning.
And it will be unlinked and freed when invalidated if writeback success.
Another change is we don't update the memcg nr_zswap_protected in the
-ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced with
swapin or concurrent shrinker action, since swapin already have memcg
nr_zswap_protected updated, don't need double counts here. For concurrent
shrinker, the folio will be writeback and freed anyway. -ENOMEM case is
extremely rare and doesn't happen spuriously either, so don't bother
distinguishing this case.
[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-2-b10479847099@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chris Li <chriscli@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-28 13:28:50 +00:00
|
|
|
count_vm_event(ZSWPWB);
|
|
|
|
if (entry->objcg)
|
|
|
|
count_objcg_event(entry->objcg, ZSWPWB);
|
|
|
|
|
|
|
|
spin_lock(&tree->lock);
|
|
|
|
zswap_invalidate_entry(tree, entry);
|
|
|
|
zswap_entry_put(entry);
|
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
|
2023-12-13 21:58:30 +00:00
|
|
|
/* folio is up to date */
|
|
|
|
folio_mark_uptodate(folio);
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2013-11-12 15:07:52 -08:00
|
|
|
/* move it to the tail of the inactive list after end_writeback */
|
2023-12-13 21:58:30 +00:00
|
|
|
folio_set_reclaim(folio);
|
2013-11-12 15:07:52 -08:00
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/* start writeback */
|
2023-12-13 21:58:31 +00:00
|
|
|
__swap_writepage(folio, &wbc);
|
2023-12-13 21:58:30 +00:00
|
|
|
folio_put(folio);
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2023-12-28 09:45:45 +00:00
|
|
|
return 0;
|
2013-07-10 16:05:03 -07:00
|
|
|
}
|
|
|
|
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
|
|
|
|
{
|
|
|
|
unsigned long *page;
|
2023-02-06 04:00:36 +09:00
|
|
|
unsigned long val;
|
|
|
|
unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
|
|
|
|
page = (unsigned long *)ptr;
|
2023-02-06 04:00:36 +09:00
|
|
|
val = page[0];
|
|
|
|
|
|
|
|
if (val != page[last_pos])
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
for (pos = 1; pos < last_pos; pos++) {
|
|
|
|
if (val != page[pos])
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
return 0;
|
|
|
|
}
|
2023-02-06 04:00:36 +09:00
|
|
|
|
|
|
|
*value = val;
|
|
|
|
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_fill_page(void *ptr, unsigned long value)
|
|
|
|
{
|
|
|
|
unsigned long *page;
|
|
|
|
|
|
|
|
page = (unsigned long *)ptr;
|
|
|
|
memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
|
|
|
|
}
|
|
|
|
|
2023-07-15 05:23:40 +01:00
|
|
|
bool zswap_store(struct folio *folio)
|
2013-07-10 16:05:03 -07:00
|
|
|
{
|
2023-08-21 18:08:48 +02:00
|
|
|
swp_entry_t swp = folio->swap;
|
2023-07-17 12:02:27 -04:00
|
|
|
pgoff_t offset = swp_offset(swp);
|
2024-01-19 11:22:23 +00:00
|
|
|
struct zswap_tree *tree = swap_zswap_tree(swp);
|
2013-07-10 16:05:03 -07:00
|
|
|
struct zswap_entry *entry, *dupentry;
|
2022-05-19 14:08:53 -07:00
|
|
|
struct obj_cgroup *objcg = NULL;
|
2023-11-30 11:40:20 -08:00
|
|
|
struct mem_cgroup *memcg = NULL;
|
2024-01-29 20:36:44 -05:00
|
|
|
struct zswap_pool *shrink_pool;
|
2023-07-17 12:02:27 -04:00
|
|
|
|
2023-07-15 05:23:40 +01:00
|
|
|
VM_WARN_ON_ONCE(!folio_test_locked(folio));
|
|
|
|
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2023-07-15 05:23:40 +01:00
|
|
|
/* Large folios aren't supported */
|
|
|
|
if (folio_test_large(folio))
|
2023-07-17 12:02:27 -04:00
|
|
|
return false;
|
mm, swap, frontswap: fix THP swap if frontswap enabled
It was reported by Sergey Senozhatsky that if THP (Transparent Huge
Page) and frontswap (via zswap) are both enabled, when memory goes low
so that swap is triggered, segfault and memory corruption will occur in
random user space applications as follow,
kernel: urxvt[338]: segfault at 20 ip 00007fc08889ae0d sp 00007ffc73a7fc40 error 6 in libc-2.26.so[7fc08881a000+1ae000]
#0 0x00007fc08889ae0d _int_malloc (libc.so.6)
#1 0x00007fc08889c2f3 malloc (libc.so.6)
#2 0x0000560e6004bff7 _Z14rxvt_wcstoutf8PKwi (urxvt)
#3 0x0000560e6005e75c n/a (urxvt)
#4 0x0000560e6007d9f1 _ZN16rxvt_perl_interp6invokeEP9rxvt_term9hook_typez (urxvt)
#5 0x0000560e6003d988 _ZN9rxvt_term9cmd_parseEv (urxvt)
#6 0x0000560e60042804 _ZN9rxvt_term6pty_cbERN2ev2ioEi (urxvt)
#7 0x0000560e6005c10f _Z17ev_invoke_pendingv (urxvt)
#8 0x0000560e6005cb55 ev_run (urxvt)
#9 0x0000560e6003b9b9 main (urxvt)
#10 0x00007fc08883af4a __libc_start_main (libc.so.6)
#11 0x0000560e6003f9da _start (urxvt)
After bisection, it was found the first bad commit is bd4c82c22c36 ("mm,
THP, swap: delay splitting THP after swapped out").
The root cause is as follows:
When the pages are written to swap device during swapping out in
swap_writepage(), zswap (fontswap) is tried to compress the pages to
improve performance. But zswap (frontswap) will treat THP as a normal
page, so only the head page is saved. After swapping in, tail pages
will not be restored to their original contents, causing memory
corruption in the applications.
This is fixed by refusing to save page in the frontswap store functions
if the page is a THP. So that the THP will be swapped out to swap
device.
Another choice is to split THP if frontswap is enabled. But it is found
that the frontswap enabling isn't flexible. For example, if
CONFIG_ZSWAP=y (cannot be module), frontswap will be enabled even if
zswap itself isn't enabled.
Frontswap has multiple backends, to make it easy for one backend to
enable THP support, the THP checking is put in backend frontswap store
functions instead of the general interfaces.
Link: http://lkml.kernel.org/r/20180209084947.22749-1-ying.huang@intel.com
Fixes: bd4c82c22c367e068 ("mm, THP, swap: delay splitting THP after swapped out")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org> [put THP checking in backend]
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shaohua Li <shli@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: <stable@vger.kernel.org> [4.14]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-21 14:45:39 -08:00
|
|
|
|
2023-09-22 19:22:11 +02:00
|
|
|
/*
|
|
|
|
* If this is a duplicate, it must be removed before attempting to store
|
|
|
|
* it, otherwise, if the store fails the old page won't be removed from
|
|
|
|
* the tree, and it might be written back overriding the new data.
|
|
|
|
*/
|
|
|
|
spin_lock(&tree->lock);
|
2024-01-29 20:36:44 -05:00
|
|
|
entry = zswap_rb_search(&tree->rbroot, offset);
|
|
|
|
if (entry) {
|
|
|
|
zswap_invalidate_entry(tree, entry);
|
2023-09-22 19:22:11 +02:00
|
|
|
zswap_duplicate_entry++;
|
|
|
|
}
|
|
|
|
spin_unlock(&tree->lock);
|
2024-02-08 02:32:54 +00:00
|
|
|
|
|
|
|
if (!zswap_enabled)
|
|
|
|
return false;
|
|
|
|
|
2023-07-15 05:23:41 +01:00
|
|
|
objcg = get_obj_cgroup_from_folio(folio);
|
2023-11-30 11:40:20 -08:00
|
|
|
if (objcg && !obj_cgroup_may_zswap(objcg)) {
|
|
|
|
memcg = get_mem_cgroup_from_objcg(objcg);
|
|
|
|
if (shrink_memcg(memcg)) {
|
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
goto reject;
|
|
|
|
}
|
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
}
|
2022-05-19 14:08:53 -07:00
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/* reclaim space if needed */
|
|
|
|
if (zswap_is_full()) {
|
|
|
|
zswap_pool_limit_hit++;
|
2020-01-30 22:15:04 -08:00
|
|
|
zswap_pool_reached_full = true;
|
2022-05-19 14:08:53 -07:00
|
|
|
goto shrink;
|
2020-01-30 22:15:04 -08:00
|
|
|
}
|
2018-07-26 16:37:42 -07:00
|
|
|
|
2020-01-30 22:15:04 -08:00
|
|
|
if (zswap_pool_reached_full) {
|
2023-07-17 12:02:27 -04:00
|
|
|
if (!zswap_can_accept())
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 20:32:27 +02:00
|
|
|
goto shrink;
|
2023-07-17 12:02:27 -04:00
|
|
|
else
|
2020-01-30 22:15:04 -08:00
|
|
|
zswap_pool_reached_full = false;
|
2013-07-10 16:05:03 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* allocate entry */
|
2024-01-29 20:36:44 -05:00
|
|
|
entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
|
2013-07-10 16:05:03 -07:00
|
|
|
if (!entry) {
|
|
|
|
zswap_reject_kmemcache_fail++;
|
|
|
|
goto reject;
|
|
|
|
}
|
|
|
|
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
if (zswap_same_filled_pages_enabled) {
|
2024-01-29 20:36:44 -05:00
|
|
|
unsigned long value;
|
|
|
|
u8 *src;
|
|
|
|
|
|
|
|
src = kmap_local_folio(folio, 0);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
if (zswap_is_page_same_filled(src, &value)) {
|
2023-11-27 16:55:21 +01:00
|
|
|
kunmap_local(src);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
entry->length = 0;
|
|
|
|
entry->value = value;
|
|
|
|
atomic_inc(&zswap_same_filled_pages);
|
|
|
|
goto insert_entry;
|
|
|
|
}
|
2023-11-27 16:55:21 +01:00
|
|
|
kunmap_local(src);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
}
|
|
|
|
|
2023-07-17 12:02:27 -04:00
|
|
|
if (!zswap_non_same_filled_pages_enabled)
|
2022-03-22 14:47:43 -07:00
|
|
|
goto freepage;
|
|
|
|
|
2015-09-09 15:35:19 -07:00
|
|
|
/* if entry is successfully added, it keeps the reference */
|
|
|
|
entry->pool = zswap_pool_current_get();
|
2023-07-17 12:02:27 -04:00
|
|
|
if (!entry->pool)
|
2015-09-09 15:35:19 -07:00
|
|
|
goto freepage;
|
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
if (objcg) {
|
|
|
|
memcg = get_mem_cgroup_from_objcg(objcg);
|
|
|
|
if (memcg_list_lru_alloc(memcg, &entry->pool->list_lru, GFP_KERNEL)) {
|
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
goto put_pool;
|
|
|
|
}
|
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
}
|
|
|
|
|
2024-01-29 20:36:43 -05:00
|
|
|
if (!zswap_compress(folio, entry))
|
|
|
|
goto put_pool;
|
2020-12-14 19:14:18 -08:00
|
|
|
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
insert_entry:
|
2024-01-29 20:36:44 -05:00
|
|
|
entry->swpentry = swp;
|
2022-05-19 14:08:53 -07:00
|
|
|
entry->objcg = objcg;
|
|
|
|
if (objcg) {
|
|
|
|
obj_cgroup_charge_zswap(objcg, entry->length);
|
|
|
|
/* Account before objcg ref is moved to tree */
|
|
|
|
count_objcg_event(objcg, ZSWPOUT);
|
|
|
|
}
|
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
/* map */
|
|
|
|
spin_lock(&tree->lock);
|
2023-09-22 19:22:11 +02:00
|
|
|
/*
|
|
|
|
* A duplicate entry should have been removed at the beginning of this
|
|
|
|
* function. Since the swap entry should be pinned, if a duplicate is
|
|
|
|
* found again here it means that something went wrong in the swap
|
|
|
|
* cache.
|
|
|
|
*/
|
2023-07-17 12:02:27 -04:00
|
|
|
while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
|
2023-09-22 19:22:11 +02:00
|
|
|
WARN_ON(1);
|
2023-07-17 12:02:27 -04:00
|
|
|
zswap_duplicate_entry++;
|
2023-07-27 12:22:23 -04:00
|
|
|
zswap_invalidate_entry(tree, dupentry);
|
2023-07-17 12:02:27 -04:00
|
|
|
}
|
2023-06-12 11:38:13 +02:00
|
|
|
if (entry->length) {
|
2023-11-30 11:40:20 -08:00
|
|
|
INIT_LIST_HEAD(&entry->lru);
|
|
|
|
zswap_lru_add(&entry->pool->list_lru, entry);
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 11:40:23 -08:00
|
|
|
atomic_inc(&entry->pool->nr_stored);
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:38:09 +02:00
|
|
|
}
|
2013-07-10 16:05:03 -07:00
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
|
|
|
|
/* update stats */
|
|
|
|
atomic_inc(&zswap_stored_pages);
|
2015-09-09 15:35:19 -07:00
|
|
|
zswap_update_total_size();
|
2022-05-19 14:08:53 -07:00
|
|
|
count_vm_event(ZSWPOUT);
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2023-07-17 12:02:27 -04:00
|
|
|
return true;
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2023-11-30 11:40:20 -08:00
|
|
|
put_pool:
|
2015-09-09 15:35:19 -07:00
|
|
|
zswap_pool_put(entry->pool);
|
|
|
|
freepage:
|
2013-07-10 16:05:03 -07:00
|
|
|
zswap_entry_cache_free(entry);
|
|
|
|
reject:
|
2022-05-19 14:08:53 -07:00
|
|
|
if (objcg)
|
|
|
|
obj_cgroup_put(objcg);
|
2023-07-17 12:02:27 -04:00
|
|
|
return false;
|
2022-05-19 14:08:53 -07:00
|
|
|
|
|
|
|
shrink:
|
2024-01-29 20:36:44 -05:00
|
|
|
shrink_pool = zswap_pool_last_get();
|
|
|
|
if (shrink_pool && !queue_work(shrink_wq, &shrink_pool->shrink_work))
|
|
|
|
zswap_pool_put(shrink_pool);
|
2022-05-19 14:08:53 -07:00
|
|
|
goto reject;
|
2013-07-10 16:05:03 -07:00
|
|
|
}
|
|
|
|
|
2023-07-15 05:23:43 +01:00
|
|
|
bool zswap_load(struct folio *folio)
|
2013-07-10 16:05:03 -07:00
|
|
|
{
|
2023-08-21 18:08:48 +02:00
|
|
|
swp_entry_t swp = folio->swap;
|
2023-07-17 12:02:27 -04:00
|
|
|
pgoff_t offset = swp_offset(swp);
|
2023-07-15 05:23:43 +01:00
|
|
|
struct page *page = &folio->page;
|
2024-01-19 11:22:23 +00:00
|
|
|
struct zswap_tree *tree = swap_zswap_tree(swp);
|
2013-07-10 16:05:03 -07:00
|
|
|
struct zswap_entry *entry;
|
2023-12-28 09:45:43 +00:00
|
|
|
u8 *dst;
|
2023-07-17 12:02:27 -04:00
|
|
|
|
2023-07-15 05:23:43 +01:00
|
|
|
VM_WARN_ON_ONCE(!folio_test_locked(folio));
|
2013-07-10 16:05:03 -07:00
|
|
|
|
|
|
|
spin_lock(&tree->lock);
|
2024-01-29 20:36:38 -05:00
|
|
|
entry = zswap_rb_search(&tree->rbroot, offset);
|
2013-07-10 16:05:03 -07:00
|
|
|
if (!entry) {
|
|
|
|
spin_unlock(&tree->lock);
|
2023-07-17 12:02:27 -04:00
|
|
|
return false;
|
2013-07-10 16:05:03 -07:00
|
|
|
}
|
2024-01-29 20:36:38 -05:00
|
|
|
zswap_entry_get(entry);
|
2013-07-10 16:05:03 -07:00
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
|
2023-12-28 09:45:44 +00:00
|
|
|
if (entry->length)
|
2024-01-29 20:36:42 -05:00
|
|
|
zswap_decompress(entry, page);
|
2023-12-28 09:45:44 +00:00
|
|
|
else {
|
2023-11-27 16:55:21 +01:00
|
|
|
dst = kmap_local_page(page);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
zswap_fill_page(dst, entry->value);
|
2023-11-27 16:55:21 +01:00
|
|
|
kunmap_local(dst);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
}
|
|
|
|
|
2022-05-19 14:08:53 -07:00
|
|
|
count_vm_event(ZSWPIN);
|
2022-05-19 14:08:53 -07:00
|
|
|
if (entry->objcg)
|
|
|
|
count_objcg_event(entry->objcg, ZSWPIN);
|
2023-12-28 09:45:42 +00:00
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
spin_lock(&tree->lock);
|
2023-12-28 09:45:44 +00:00
|
|
|
if (zswap_exclusive_loads_enabled) {
|
2023-06-07 19:51:43 +00:00
|
|
|
zswap_invalidate_entry(tree, entry);
|
2023-07-15 05:23:43 +01:00
|
|
|
folio_mark_dirty(folio);
|
2023-06-12 11:38:13 +02:00
|
|
|
} else if (entry->length) {
|
2023-11-30 11:40:20 -08:00
|
|
|
zswap_lru_del(&entry->pool->list_lru, entry);
|
|
|
|
zswap_lru_add(&entry->pool->list_lru, entry);
|
2023-06-07 19:51:43 +00:00
|
|
|
}
|
2024-01-25 08:14:23 +00:00
|
|
|
zswap_entry_put(entry);
|
2013-07-10 16:05:03 -07:00
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
|
2023-12-28 09:45:44 +00:00
|
|
|
return true;
|
2013-07-10 16:05:03 -07:00
|
|
|
}
|
|
|
|
|
2023-07-17 12:02:27 -04:00
|
|
|
void zswap_invalidate(int type, pgoff_t offset)
|
2013-07-10 16:05:03 -07:00
|
|
|
{
|
2024-01-19 11:22:23 +00:00
|
|
|
struct zswap_tree *tree = swap_zswap_tree(swp_entry(type, offset));
|
2013-07-10 16:05:03 -07:00
|
|
|
struct zswap_entry *entry;
|
|
|
|
|
|
|
|
spin_lock(&tree->lock);
|
|
|
|
entry = zswap_rb_search(&tree->rbroot, offset);
|
2024-01-29 20:36:45 -05:00
|
|
|
if (entry)
|
|
|
|
zswap_invalidate_entry(tree, entry);
|
2013-07-10 16:05:03 -07:00
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
}
|
|
|
|
|
2024-01-19 11:22:23 +00:00
|
|
|
int zswap_swapon(int type, unsigned long nr_pages)
|
2023-07-17 12:02:27 -04:00
|
|
|
{
|
2024-01-19 11:22:23 +00:00
|
|
|
struct zswap_tree *trees, *tree;
|
|
|
|
unsigned int nr, i;
|
2023-07-17 12:02:27 -04:00
|
|
|
|
2024-01-19 11:22:23 +00:00
|
|
|
nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
|
|
|
|
trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
|
|
|
|
if (!trees) {
|
2023-07-17 12:02:27 -04:00
|
|
|
pr_err("alloc failed, zswap disabled for swap type %d\n", type);
|
mm/zswap: make sure each swapfile always have zswap rb-tree
Patch series "mm/zswap: optimize the scalability of zswap rb-tree", v2.
When testing the zswap performance by using kernel build -j32 in a tmpfs
directory, I found the scalability of zswap rb-tree is not good, which is
protected by the only spinlock. That would cause heavy lock contention if
multiple tasks zswap_store/load concurrently.
So a simple solution is to split the only one zswap rb-tree into multiple
rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea
is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks").
Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention. Below is the results of kernel
build in tmpfs with zswap shrinker enabled:
linux-next zswap-lock-optimize
real 1m9.181s 1m3.820s
user 17m44.036s 17m40.100s
sys 7m37.297s 4m54.622s
So there are clearly improvements. And it's complementary with the
ongoing zswap xarray conversion by Chris. Anyway, I think we can also
merge this first, it's complementary IMHO. So I just refresh and resend
this for further discussion.
This patch (of 2):
Not all zswap interfaces can handle the absence of the zswap rb-tree,
actually only zswap_store() has handled it for now.
To make things simple, we make sure each swapfile always have the zswap
rb-tree prepared before being enabled and used. The preparation is
unlikely to fail in practice, this patch just make it explicit.
Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-0-b5cc55479090@bytedance.com
Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-1-b5cc55479090@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chriscli@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-19 11:22:22 +00:00
|
|
|
return -ENOMEM;
|
2023-07-17 12:02:27 -04:00
|
|
|
}
|
|
|
|
|
2024-01-19 11:22:23 +00:00
|
|
|
for (i = 0; i < nr; i++) {
|
|
|
|
tree = trees + i;
|
|
|
|
tree->rbroot = RB_ROOT;
|
|
|
|
spin_lock_init(&tree->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
nr_zswap_trees[type] = nr;
|
|
|
|
zswap_trees[type] = trees;
|
mm/zswap: make sure each swapfile always have zswap rb-tree
Patch series "mm/zswap: optimize the scalability of zswap rb-tree", v2.
When testing the zswap performance by using kernel build -j32 in a tmpfs
directory, I found the scalability of zswap rb-tree is not good, which is
protected by the only spinlock. That would cause heavy lock contention if
multiple tasks zswap_store/load concurrently.
So a simple solution is to split the only one zswap rb-tree into multiple
rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea
is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks").
Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention. Below is the results of kernel
build in tmpfs with zswap shrinker enabled:
linux-next zswap-lock-optimize
real 1m9.181s 1m3.820s
user 17m44.036s 17m40.100s
sys 7m37.297s 4m54.622s
So there are clearly improvements. And it's complementary with the
ongoing zswap xarray conversion by Chris. Anyway, I think we can also
merge this first, it's complementary IMHO. So I just refresh and resend
this for further discussion.
This patch (of 2):
Not all zswap interfaces can handle the absence of the zswap rb-tree,
actually only zswap_store() has handled it for now.
To make things simple, we make sure each swapfile always have the zswap
rb-tree prepared before being enabled and used. The preparation is
unlikely to fail in practice, this patch just make it explicit.
Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-0-b5cc55479090@bytedance.com
Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-1-b5cc55479090@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chriscli@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-19 11:22:22 +00:00
|
|
|
return 0;
|
2023-07-17 12:02:27 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
void zswap_swapoff(int type)
|
2013-07-10 16:05:03 -07:00
|
|
|
{
|
2024-01-19 11:22:23 +00:00
|
|
|
struct zswap_tree *trees = zswap_trees[type];
|
|
|
|
unsigned int i;
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2024-01-19 11:22:23 +00:00
|
|
|
if (!trees)
|
2013-07-10 16:05:03 -07:00
|
|
|
return;
|
|
|
|
|
2024-01-24 04:51:12 +00:00
|
|
|
/* try_to_unuse() invalidated all the entries already */
|
|
|
|
for (i = 0; i < nr_zswap_trees[type]; i++)
|
|
|
|
WARN_ON_ONCE(!RB_EMPTY_ROOT(&trees[i].rbroot));
|
2024-01-19 11:22:23 +00:00
|
|
|
|
|
|
|
kvfree(trees);
|
|
|
|
nr_zswap_trees[type] = 0;
|
2013-10-16 13:46:54 -07:00
|
|
|
zswap_trees[type] = NULL;
|
2013-07-10 16:05:03 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*********************************
|
|
|
|
* debugfs functions
|
|
|
|
**********************************/
|
|
|
|
#ifdef CONFIG_DEBUG_FS
|
|
|
|
#include <linux/debugfs.h>
|
|
|
|
|
|
|
|
static struct dentry *zswap_debugfs_root;
|
|
|
|
|
2023-04-03 20:13:18 +08:00
|
|
|
static int zswap_debugfs_init(void)
|
2013-07-10 16:05:03 -07:00
|
|
|
{
|
|
|
|
if (!debugfs_initialized())
|
|
|
|
return -ENODEV;
|
|
|
|
|
|
|
|
zswap_debugfs_root = debugfs_create_dir("zswap", NULL);
|
|
|
|
|
2018-06-14 15:27:58 -07:00
|
|
|
debugfs_create_u64("pool_limit_hit", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_pool_limit_hit);
|
|
|
|
debugfs_create_u64("reject_reclaim_fail", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_reclaim_fail);
|
|
|
|
debugfs_create_u64("reject_alloc_fail", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_alloc_fail);
|
|
|
|
debugfs_create_u64("reject_kmemcache_fail", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_kmemcache_fail);
|
2023-10-24 16:45:09 -07:00
|
|
|
debugfs_create_u64("reject_compress_fail", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_compress_fail);
|
2018-06-14 15:27:58 -07:00
|
|
|
debugfs_create_u64("reject_compress_poor", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_compress_poor);
|
|
|
|
debugfs_create_u64("written_back_pages", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_written_back_pages);
|
|
|
|
debugfs_create_u64("duplicate_entry", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_duplicate_entry);
|
|
|
|
debugfs_create_u64("pool_total_size", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_pool_total_size);
|
|
|
|
debugfs_create_atomic_t("stored_pages", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_stored_pages);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 16:15:59 -08:00
|
|
|
debugfs_create_atomic_t("same_filled_pages", 0444,
|
2018-06-14 15:27:58 -07:00
|
|
|
zswap_debugfs_root, &zswap_same_filled_pages);
|
2013-07-10 16:05:03 -07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else
|
2023-04-03 20:13:18 +08:00
|
|
|
static int zswap_debugfs_init(void)
|
2013-07-10 16:05:03 -07:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*********************************
|
|
|
|
* module init and exit
|
|
|
|
**********************************/
|
2023-04-03 20:13:18 +08:00
|
|
|
static int zswap_setup(void)
|
2013-07-10 16:05:03 -07:00
|
|
|
{
|
2015-09-09 15:35:19 -07:00
|
|
|
struct zswap_pool *pool;
|
2016-11-27 00:13:39 +01:00
|
|
|
int ret;
|
2014-04-07 15:38:27 -07:00
|
|
|
|
2023-04-03 20:13:16 +08:00
|
|
|
zswap_entry_cache = KMEM_CACHE(zswap_entry, 0);
|
|
|
|
if (!zswap_entry_cache) {
|
2013-07-10 16:05:03 -07:00
|
|
|
pr_err("entry cache creation failed\n");
|
2015-09-09 15:35:19 -07:00
|
|
|
goto cache_fail;
|
2013-07-10 16:05:03 -07:00
|
|
|
}
|
2015-09-09 15:35:19 -07:00
|
|
|
|
2016-11-27 00:13:40 +01:00
|
|
|
ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE,
|
|
|
|
"mm/zswap_pool:prepare",
|
|
|
|
zswap_cpu_comp_prepare,
|
|
|
|
zswap_cpu_comp_dead);
|
|
|
|
if (ret)
|
|
|
|
goto hp_fail;
|
|
|
|
|
2015-09-09 15:35:19 -07:00
|
|
|
pool = __zswap_pool_create_fallback();
|
2017-02-27 14:26:47 -08:00
|
|
|
if (pool) {
|
|
|
|
pr_info("loaded using pool %s/%s\n", pool->tfm_name,
|
2023-06-20 19:46:44 +00:00
|
|
|
zpool_get_type(pool->zpools[0]));
|
2017-02-27 14:26:47 -08:00
|
|
|
list_add(&pool->list, &zswap_pools);
|
|
|
|
zswap_has_pool = true;
|
|
|
|
} else {
|
2015-09-09 15:35:19 -07:00
|
|
|
pr_err("pool creation failed\n");
|
2017-02-27 14:26:47 -08:00
|
|
|
zswap_enabled = false;
|
2013-07-10 16:05:03 -07:00
|
|
|
}
|
2014-04-07 15:38:27 -07:00
|
|
|
|
2024-01-16 23:31:45 +10:00
|
|
|
shrink_wq = alloc_workqueue("zswap-shrink",
|
|
|
|
WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
|
2020-01-30 22:15:04 -08:00
|
|
|
if (!shrink_wq)
|
|
|
|
goto fallback_fail;
|
|
|
|
|
2013-07-10 16:05:03 -07:00
|
|
|
if (zswap_debugfs_init())
|
|
|
|
pr_warn("debugfs initialization failed\n");
|
2023-04-03 20:13:17 +08:00
|
|
|
zswap_init_state = ZSWAP_INIT_SUCCEED;
|
2013-07-10 16:05:03 -07:00
|
|
|
return 0;
|
2015-09-09 15:35:19 -07:00
|
|
|
|
2020-01-30 22:15:04 -08:00
|
|
|
fallback_fail:
|
2020-01-30 22:15:07 -08:00
|
|
|
if (pool)
|
|
|
|
zswap_pool_destroy(pool);
|
2016-11-27 00:13:40 +01:00
|
|
|
hp_fail:
|
2023-04-03 20:13:16 +08:00
|
|
|
kmem_cache_destroy(zswap_entry_cache);
|
2015-09-09 15:35:19 -07:00
|
|
|
cache_fail:
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 13:13:09 -08:00
|
|
|
/* if built-in, we aren't unloaded on failure; don't allow use */
|
2023-04-03 20:13:17 +08:00
|
|
|
zswap_init_state = ZSWAP_INIT_FAILED;
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 13:13:09 -08:00
|
|
|
zswap_enabled = false;
|
2013-07-10 16:05:03 -07:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2023-04-03 20:13:18 +08:00
|
|
|
|
|
|
|
static int __init zswap_init(void)
|
|
|
|
{
|
|
|
|
if (!zswap_enabled)
|
|
|
|
return 0;
|
|
|
|
return zswap_setup();
|
|
|
|
}
|
2013-07-10 16:05:03 -07:00
|
|
|
/* must be late so crypto has time to come up */
|
2023-04-03 20:13:18 +08:00
|
|
|
late_initcall(zswap_init);
|
2013-07-10 16:05:03 -07:00
|
|
|
|
2014-11-12 21:08:46 -06:00
|
|
|
MODULE_AUTHOR("Seth Jennings <sjennings@variantweb.net>");
|
2013-07-10 16:05:03 -07:00
|
|
|
MODULE_DESCRIPTION("Compressed cache for swap pages");
|