linux-stable/Documentation
Shakeel Butt 94968384dd memcg: introduce per-memcg reclaim interface
This patch series adds a memory.reclaim proactive reclaim interface.
The rationale behind the interface and how it works are in the first
patch.


This patch (of 4):

Introduce a memcg interface to trigger memory reclaim on a memory cgroup.

Use case: Proactive Reclaim
---------------------------

A userspace proactive reclaimer can continuously probe the memcg to
reclaim a small amount of memory.  This gives more accurate and up-to-date
workingset estimation as the LRUs are continuously sorted and can
potentially provide more deterministic memory overcommit behavior.  The
memory overcommit controller can provide more proactive response to the
changing behavior of the running applications instead of being reactive.

A userspace reclaimer's purpose in this case is not a complete replacement
for kswapd or direct reclaim, it is to proactively identify memory savings
opportunities and reclaim some amount of cold pages set by the policy to
free up the memory for more demanding jobs or scheduling new jobs.

A user space proactive reclaimer is used in Google data centers. 
Additionally, Meta's TMO paper recently referenced a very similar
interface used for user space proactive reclaim:
https://dl.acm.org/doi/pdf/10.1145/3503222.3507731

Benefits of a user space reclaimer:
-----------------------------------

1) More flexible on who should be charged for the cpu of the memory
   reclaim.  For proactive reclaim, it makes more sense to be centralized.

2) More flexible on dedicating the resources (like cpu).  The memory
   overcommit controller can balance the cost between the cpu usage and
   the memory reclaimed.

3) Provides a way to the applications to keep their LRUs sorted, so,
   under memory pressure better reclaim candidates are selected.  This
   also gives more accurate and uptodate notion of working set for an
   application.

Why memory.high is not enough?
------------------------------

- memory.high can be used to trigger reclaim in a memcg and can
  potentially be used for proactive reclaim.  However there is a big
  downside in using memory.high.  It can potentially introduce high
  reclaim stalls in the target application as the allocations from the
  processes or the threads of the application can hit the temporary
  memory.high limit.

- Userspace proactive reclaimers usually use feedback loops to decide
  how much memory to proactively reclaim from a workload.  The metrics
  used for this are usually either refaults or PSI, and these metrics will
  become messy if the application gets throttled by hitting the high
  limit.

- memory.high is a stateful interface, if the userspace proactive
  reclaimer crashes for any reason while triggering reclaim it can leave
  the application in a bad state.

- If a workload is rapidly expanding, setting memory.high to proactively
  reclaim memory can result in actually reclaiming more memory than
  intended.

The benefits of such interface and shortcomings of existing interface were
further discussed in this RFC thread:
https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/

Interface:
----------

Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
trigger reclaim in the target memory cgroup.

The interface is introduced as a nested-keyed file to allow for future
optional arguments to be easily added to configure the behavior of
reclaim.

Possible Extensions:
--------------------

- This interface can be extended with an additional parameter or flags
  to allow specifying one or more types of memory to reclaim from (e.g.
  file, anon, ..).

- The interface can also be extended with a node mask to reclaim from
  specific nodes. This has use cases for reclaim-based demotion in memory
  tiering systens.

- A similar per-node interface can also be added to support proactive
  reclaim and reclaim-based demotion in systems without memcg.

- Add a timeout parameter to make it easier for user space to call the
  interface without worrying about being blocked for an undefined amount
  of time.

For now, let's keep things simple by adding the basic functionality.

[yosryahmed@google.com: worked on versions v2 onwards, refreshed to
current master, updated commit message based on recent
discussions and use cases]
Link: https://lkml.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20220425190040.2475377-2-yosryahmed@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Co-developed-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Wei Xu <weixugc@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Chen Wandun <chenwandun@huawei.com>
Cc: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29 14:36:59 -07:00
..
ABI Changes since last update: 2022-04-20 12:35:20 -07:00
accounting - A bunch of fixes: forced idle time accounting, utilization values 2022-01-23 17:35:27 +02:00
admin-guide memcg: introduce per-memcg reclaim interface 2022-04-29 14:36:59 -07:00
arc
arm Documentation: arm: marvell: Extend Avanta list 2022-01-27 11:22:34 -07:00
arm64 Merge branch 'for-next/mte' into for-next/core 2022-03-14 19:01:23 +00:00
block block: remove biodoc.rst 2022-02-15 07:47:52 -07:00
bpf docs: netdev: move the netdev-FAQ to the process pages 2022-03-31 10:49:39 +02:00
cdrom Documentation: Fix links for udftools project and pktcdvd tool 2022-02-15 16:15:33 -07:00
core-api XArray update for 5.18: 2022-04-01 13:40:44 -07:00
cpu-freq cpufreq: Reintroduce ready() callback 2022-02-09 13:18:49 +05:30
crypto
dev-tools Documentation: kunit: fix path to .kunitconfig in start.rst 2022-04-04 12:02:44 -06:00
devicetree Networking fixes for 5.18-rc5, including fixes from bluetooth, bpf 2022-04-28 12:34:50 -07:00
doc-guide docs: discourage use of list tables 2022-01-07 09:33:13 -07:00
driver-api Merge drm-misc/drm-misc-next-fixes into drm-misc-fixes 2022-04-05 11:39:22 +02:00
fault-injection
fb
features nds32: Remove the architecture 2022-03-07 13:54:59 +01:00
filesystems f2fs-fix-5.18 2022-04-25 10:53:56 -07:00
firmware_class
firmware-guide Merge branch 'i2c/for-mergewindow' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2022-03-26 12:46:08 -07:00
fpga
gpu pci-v5.18-changes 2022-03-25 13:02:05 -07:00
hid
hwmon Char/Misc and other driver updates for 5.18-rc1 2022-03-28 12:27:35 -07:00
i2c i2c: i801: Add support for Intel Raptor Lake PCH-S 2022-02-15 10:03:40 +01:00
ia64
ide
iio
infiniband
input Input: docs: add more details on the use of BTN_TOOL 2022-03-01 15:46:03 +01:00
isdn
kbuild kbuild: Make $(LLVM) more flexible 2022-03-31 12:03:46 +09:00
kernel-hacking docs: fix typo in Documentation/kernel-hacking/locking.rst 2022-01-27 11:22:33 -07:00
leds
litmus-tests
livepatch
locking Documentation: Fix duplicate statement about raw_spinlock_t type 2022-03-25 13:30:08 -06:00
m68k
maintainer Some late-arriving documentation improvements. This is mostly build-system 2022-03-31 12:10:42 -07:00
mhi
mips
misc-devices
netlabel
networking doc/ip-sysctl: add bc_forwarding 2022-04-20 10:31:43 +01:00
nios2
nvdimm
openrisc
parisc
PCI PCI/doc: cleanup references to the legacy PCI DMA API 2022-03-30 16:54:24 +02:00
pcmcia
peci docs: Add PECI documentation 2022-02-09 08:04:44 +01:00
power Documentation: EM: Describe new registration method using DT 2022-03-03 09:35:04 +05:30
powerpc
process docs: netdev: move the netdev-FAQ to the process pages 2022-03-31 10:49:39 +02:00
RCU
riscv Documentation: riscv: remove non-existent directory from table of contents 2022-03-31 16:18:56 -07:00
s390
scheduler Changes in this cycle were: 2022-03-22 14:39:12 -07:00
scsi scsi: ufs: docs: UFS documentation corrections 2022-03-08 22:49:49 -05:00
security selinux/stable-5.18 PR 20220321 2022-03-21 20:47:54 -07:00
sh
sound ALSA: hda/realtek: Add alc256-samsung-headphone fixup 2022-03-22 21:51:02 +01:00
sparc
sphinx docs: sphinx/requirements: Limit jinja2<3.1 2022-03-30 13:44:54 -06:00
sphinx-static
spi spi: pxa2xx_spi: Convert to use GPIO descriptors 2022-01-31 15:17:27 +00:00
staging remoteproc: Change rproc_shutdown() to return a status 2022-03-11 14:31:55 -06:00
target
timers
tools Real Time Analysis Tool updates for 5.18 2022-03-23 11:08:10 -07:00
trace Updates to Tracing: 2022-04-03 12:26:01 -07:00
translations Kbuild -std=gnu11 updates for v5.18 2022-03-25 11:48:01 -07:00
tty
usb usb: gadget: f_uac2: Optionally determine bInterval for HS and SS 2022-01-31 14:26:18 +01:00
userspace-api platform-drivers-x86 for v5.18-1 2022-03-25 12:14:39 -07:00
virt Documentation: KVM: Add SPDX-License-Identifier tag 2022-04-11 13:28:56 -04:00
vm mm/sparse-vmemmap: improve memory savings for compound devmaps 2022-04-28 23:16:16 -07:00
w1
watchdog
x86 - More noinstr fixes 2022-03-25 12:34:53 -07:00
xtensa
.gitignore
arch.rst
asm-annotations.rst linkage: remove SYM_FUNC_{START,END}_ALIAS() 2022-02-22 16:21:34 +00:00
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py docs: pdfdocs: Pull LaTeX preamble part out of conf.py 2022-02-24 12:26:13 -07:00
COPYING-logo
docutils.conf
dontdiff
index.rst docs: Add PECI documentation 2022-02-09 08:04:44 +01:00
Kconfig
logo.gif
Makefile docs: Makefile: Add -no-shell-escape option to LATEXOPTS 2022-02-14 12:50:17 -07:00
memory-barriers.txt
SubmittingPatches
watch_queue.rst