mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-04 04:06:26 +00:00
e31c38e037
Patch series "improving dynamic zswap shrinker protection scheme", v3. When experimenting with the memory-pressure based (i.e "dynamic") zswap shrinker in production, we observed a sharp increase in the number of swapins, which led to performance regression. We were able to trace this regression to the following problems with the shrinker's warm pages protection scheme: 1. The protection decays way too rapidly, and the decaying is coupled with zswap stores, leading to anomalous patterns, in which a small batch of zswap stores effectively erase all the protection in place for the warmer pages in the zswap LRU. This observation has also been corroborated upstream by Takero Funaki (in [1]). 2. We inaccurately track the number of swapped in pages, missing the non-pivot pages that are part of the readahead window, while counting the pages that are found in the zswap pool. To alleviate these two issues, this patch series improve the dynamic zswap shrinker in the following manner: 1. Replace the protection size tracking scheme with a second chance algorithm. This new scheme removes the need for haphazard stats decaying, and automatically adjusts the pace of pages aging with memory pressure, and writeback rate with pool activities: slowing down when the pool is dominated with zswpouts, and speeding up when the pool is dominated with stale entries. 2. Fix the tracking of the number of swapins to take into account non-pivot pages in the readahead window. With these two changes in place, in a kernel-building benchmark without any cold data added, the number of swapins is reduced by 64.12%. This translate to a 10.32% reduction in build time. We also observe a 3% reduction in kernel CPU time. In another benchmark, with cold data added (to gauge the new algorithm's ability to offload cold data), the new second chance scheme outperforms the old protection scheme by around 0.7%, and actually written back around 21% more pages to backing swap device. So the new scheme is just as good, if not even better than the old scheme on this front as well. [1]: https://lore.kernel.org/linux-mm/CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com/ This patch (of 2): Current zswap shrinker's heuristics to prevent overshrinking is brittle and inaccurate, specifically in the way we decay the protection size (i.e making pages in the zswap LRU eligible for reclaim). We currently decay protection aggressively in zswap_lru_add() calls. This leads to the following unfortunate effect: when a new batch of pages enter zswap, the protection size rapidly decays to below 25% of the zswap LRU size, which is way too low. We have observed this effect in production, when experimenting with the zswap shrinker: the rate of shrinking shoots up massively right after a new batch of zswap stores. This is somewhat the opposite of what we want originally - when new pages enter zswap, we want to protect both these new pages AND the pages that are already protected in the zswap LRU. Replace existing heuristics with a second chance algorithm 1. When a new zswap entry is stored in the zswap pool, its referenced bit is set. 2. When the zswap shrinker encounters a zswap entry with the referenced bit set, give it a second chance - only flips the referenced bit and rotate it in the LRU. 3. If the shrinker encounters the entry again, this time with its referenced bit unset, then it can reclaim the entry. In this manner, the aging of the pages in the zswap LRUs are decoupled from zswap stores, and picks up the pace with increasing memory pressure (which is what we want). The second chance scheme allows us to modulate the writeback rate based on recent pool activities. Entries that recently entered the pool will be protected, so if the pool is dominated by such entries the writeback rate will reduce proportionally, protecting the workload's workingset.On the other hand, stale entries will be written back quickly, which increases the effective writeback rate. The referenced bit is added at the hole after the `length` field of struct zswap_entry, so there is no extra space overhead for this algorithm. We will still maintain the count of swapins, which is consumed and subtracted from the lru size in zswap_shrinker_count(), to further penalize past overshrinking that led to disk swapins. The idea is that had we considered this many more pages in the LRU active/protected, they would not have been written back and we would not have had to swapped them in. To test this new heuristics, I built the kernel under a cgroup with memory.max set to 2G, on a host with 36 cores: With the old shrinker: real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5 With the second chance algorithm: real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663 (average over 5 runs) We observe an 1.3% reduction in kernel CPU usage, and around 7.2% reduction in real time. Note that the number of swapped in pages dropped by 58%. [nphamcs@gmail.com: fix a small mistake in the referenced bit documentation] Link: https://lkml.kernel.org/r/20240806003403.3142387-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20240805232243.2896283-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20240805232243.2896283-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Takero Funaki <flintglass@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
75 lines
1.8 KiB
C
75 lines
1.8 KiB
C
/* SPDX-License-Identifier: GPL-2.0 */
|
|
#ifndef _LINUX_ZSWAP_H
|
|
#define _LINUX_ZSWAP_H
|
|
|
|
#include <linux/types.h>
|
|
#include <linux/mm_types.h>
|
|
|
|
struct lruvec;
|
|
|
|
extern atomic_t zswap_stored_pages;
|
|
|
|
#ifdef CONFIG_ZSWAP
|
|
|
|
struct zswap_lruvec_state {
|
|
/*
|
|
* Number of swapped in pages from disk, i.e not found in the zswap pool.
|
|
*
|
|
* This is consumed and subtracted from the lru size in
|
|
* zswap_shrinker_count() to penalize past overshrinking that led to disk
|
|
* swapins. The idea is that had we considered this many more pages in the
|
|
* LRU active/protected and not written them back, we would not have had to
|
|
* swapped them in.
|
|
*/
|
|
atomic_long_t nr_disk_swapins;
|
|
};
|
|
|
|
unsigned long zswap_total_pages(void);
|
|
bool zswap_store(struct folio *folio);
|
|
bool zswap_load(struct folio *folio);
|
|
void zswap_invalidate(swp_entry_t swp);
|
|
int zswap_swapon(int type, unsigned long nr_pages);
|
|
void zswap_swapoff(int type);
|
|
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
|
|
void zswap_lruvec_state_init(struct lruvec *lruvec);
|
|
void zswap_folio_swapin(struct folio *folio);
|
|
bool zswap_is_enabled(void);
|
|
bool zswap_never_enabled(void);
|
|
#else
|
|
|
|
struct zswap_lruvec_state {};
|
|
|
|
static inline bool zswap_store(struct folio *folio)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline bool zswap_load(struct folio *folio)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline void zswap_invalidate(swp_entry_t swp) {}
|
|
static inline int zswap_swapon(int type, unsigned long nr_pages)
|
|
{
|
|
return 0;
|
|
}
|
|
static inline void zswap_swapoff(int type) {}
|
|
static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
|
|
static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
|
|
static inline void zswap_folio_swapin(struct folio *folio) {}
|
|
|
|
static inline bool zswap_is_enabled(void)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline bool zswap_never_enabled(void)
|
|
{
|
|
return true;
|
|
}
|
|
|
|
#endif
|
|
|
|
#endif /* _LINUX_ZSWAP_H */
|