linux-stable/mm
Joshua Hahn 05d4532b60 memcg/hugetlb: add hugeTLB counters to memcg
This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current.  This feature
is enabled the same way hugeTLB accounting is enabled, via the
memory_hugetlb_accounting mount flag for cgroupsv2.

1. Why is this patch necessary?
Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
hugeTLB usage to memory.current.  However, the metric is not reported in
memory.stat.  Given that users often interpret memory.stat as a breakdown
of the value reported in memory.current, the disparity between the two
reports can be confusing.  This patch solves this problem by including the
metric in memory.stat as well, but only if it is also reported in
memory.current (it would also be confusing if the value was reported in
memory.stat, but not in memory.current)

Aside from the consistency between the two files, we also see benefits in
observability.  Userspace might be interested in the hugeTLB footprint of
cgroups for many reasons.  For instance, system admins might want to
verify that hugeTLB usage is distributed as expected across tasks: i.e. 
memory-intensive tasks are using more hugeTLB pages than tasks that don't
consume a lot of memory, or are seen to fault frequently.  Note that this
is separate from wanting to inspect the distribution for limiting purposes
(in which case, hugeTLB controller makes more sense).

2. We already have a hugeTLB controller. Why not use that?
It is true that hugeTLB tracks the exact value that we want.  In fact, by
enabling the hugeTLB controller, we get all of the observability benefits
that I mentioned above, and users can check the total hugeTLB usage,
verify if it is distributed as expected, etc.

With this said, there are 2 problems:
(a) They are still not reported in memory.stat, which means the
    disparity between the memcg reports are still there.
(b) We cannot reasonably expect users to enable the hugeTLB controller
    just for the sake of hugeTLB usage reporting, especially since
    they don't have any use for hugeTLB usage enforcing [2].

3. Implementation Details:
In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
regardless of whether memcg accounts hugetlb.  mem_cgroup_commit_charge
which is called from alloc_hugetlb_folio will set memcg for the folio only
if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is used, so
lruvec_stat_mod_folio accounts per-memcg hugetlb counters only if the
feature is enabled.  Regardless of whether memcg accounts for hugetlb, the
newly added global counter is updated and shown in /proc/vmstat.

The global counter is added because vmstats is the preferred framework for
cgroup stats.  It makes stat items consistent between global and cgroups. 
It also provides a per-node breakdown, which is useful.  Because it does
not use cgroup-specific hooks, we also keep generic MM code separate from
memcg code.

[1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
[2] Of course, we can't make a new patch for every feature that can be
    duplicated. However, since the existing solution of enabling the
    hugeTLB controller is an imperfect solution that still leaves a
    discrepancy between memory.stat and memory.curent, I think that it
    is reasonable to isolate the feature in this case.

Link: https://lkml.kernel.org/r/20241101204402.1885383-1-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14 22:49:19 -08:00
..
damon Merge branch 'mm-hotfixes-stable' into mm-stable 2024-11-11 00:04:10 -08:00
kasan mm: define general function pXd_init() 2024-11-11 17:22:27 -08:00
kfence mm: kfence: fix elapsed time for allocated/freed track 2024-09-26 14:01:44 -07:00
kmsan mm, kasan, kmsan: instrument copy_from/to_kernel_nofault 2024-11-06 20:11:14 -08:00
backing-dev.c writeback: support retrieving per group debug writeback stats of bdi 2024-05-05 17:53:51 -07:00
balloon_compaction.c mm: remove MIGRATE_SYNC_NO_COPY mode 2024-07-03 19:30:00 -07:00
bootmem_info.c bootmem: stop using page->index 2024-11-07 14:38:07 -08:00
cma_debug.c
cma_sysfs.c mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
cma.c mm/cma: fix useless return in void function 2024-11-05 16:56:30 -08:00
cma.h mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
compaction.c mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath() 2024-09-03 21:15:47 -07:00
debug_page_alloc.c mm: page_alloc: consolidate free page accounting 2024-04-25 20:56:04 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries 2024-09-17 01:07:01 -07:00
debug.c mm: support only one page_type per page 2024-09-03 21:15:43 -07:00
dmapool_test.c mm/dmapool: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
dmapool.c mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs 2023-12-05 11:17:58 +01:00
early_ioremap.c mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup() 2023-06-09 16:25:56 -07:00
execmem.c alloc_tag: populate memory for module tags as needed 2024-11-07 14:25:16 -08:00
fadvise.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
fail_page_alloc.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
failslab.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
filemap.c memcg-v1: remove memcg move locking code 2024-11-06 20:11:19 -08:00
folio-compat.c mm: remove putback_lru_page() 2024-09-09 16:38:59 -07:00
gup_test.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
gup_test.h
gup.c gup: convert FOLL_TOUCH case in follow_page_pte() to folio 2024-11-06 20:11:08 -08:00
highmem.c mm/highmem: make nr_free_highpages() return "unsigned long" 2024-07-03 19:30:06 -07:00
hmm.c mm: provide mm_struct and address to huge_ptep_get() 2024-07-12 15:52:15 -07:00
huge_memory.c mm: huge_memory: use strscpy() instead of strcpy() 2024-11-11 13:09:43 -08:00
hugetlb_cgroup.c mm: memcg: don't call propagate_protected_usage() needlessly 2024-09-01 20:25:50 -07:00
hugetlb_vmemmap.c mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO 2024-09-01 20:25:45 -07:00
hugetlb_vmemmap.h mm: hugetlb_vmemmap: fix reference to nonexistent file 2023-10-25 16:47:14 -07:00
hugetlb.c memcg/hugetlb: add hugeTLB counters to memcg 2024-11-14 22:49:19 -08:00
hwpoison-inject.c mm/hwpoison: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
init-mm.c mm: Deprecate pasid field 2023-12-12 10:11:32 +01:00
internal.h mm: move `get_order_from_str()` to internal.h 2024-11-11 13:09:43 -08:00
interval_tree.c
io-mapping.c
ioremap.c mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed 2023-08-18 10:12:36 -07:00
Kconfig resource: remove dependency on SPARSEMEM from GET_FREE_REGION 2024-10-28 21:40:39 -07:00
Kconfig.debug slub: Introduce CONFIG_SLUB_RCU_DEBUG 2024-08-27 14:12:51 +02:00
khugepaged.c mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock() 2024-11-05 16:56:27 -08:00
kmemleak.c kmemleak: iommu/iova: fix transient kmemleak false positive 2024-11-11 17:22:26 -08:00
ksm.c mm: mass constification of folio/page pointers 2024-11-07 14:38:07 -08:00
list_lru.c mm/list_lru: simplify the list_lru walk callback function 2024-11-11 17:22:26 -08:00
maccess.c kasan: migrate copy_user_test to kunit 2024-11-11 00:26:44 -08:00
madvise.c mm: madvise: implement lightweight guard page mechanism 2024-11-11 00:26:45 -08:00
Makefile mm: introduce numa_emulation 2024-09-03 21:15:31 -07:00
mapping_dirty_helpers.c mm: fix clean_record_shared_mapping_range kernel-doc 2023-08-24 16:20:30 -07:00
memblock.c memblock: updates for 6.12-rc1 2024-09-25 11:35:19 -07:00
memcontrol-v1.c memcg-v1: remove memcg move locking code 2024-11-06 20:11:19 -08:00
memcontrol-v1.h memcg-v1: remove charge move code 2024-11-06 20:11:18 -08:00
memcontrol.c memcg/hugetlb: add hugeTLB counters to memcg 2024-11-14 22:49:19 -08:00
memfd.c mm/hugetlb: simplify refs in memfd_alloc_folio 2024-09-26 14:01:44 -07:00
memory_hotplug.c kaslr: rename physmem_end and PHYSMEM_END to direct_map_physmem_end 2024-11-06 20:11:11 -08:00
memory-failure.c mm/memory-failure: replace sprintf() with sysfs_emit() 2024-11-11 00:26:46 -08:00
memory-tiers.c memory tiers: use default_dram_perf_ref_source in log message 2024-09-26 14:01:44 -07:00
memory.c mm: add PTE_MARKER_GUARD PTE marker 2024-11-11 00:26:44 -08:00
mempolicy.c mm: renovate page_address_in_vma() 2024-11-07 14:38:07 -08:00
mempool.c mm: fix xyz_noprof functions calling profiled functions 2024-06-05 19:19:26 -07:00
memremap.c mm: convert put_devmap_managed_page_refs() to put_devmap_managed_folio_refs() 2024-05-05 17:53:49 -07:00
memtest.c memtest: use {READ,WRITE}_ONCE in memory scanning 2024-03-13 12:12:21 -07:00
migrate_device.c mm: remap unused subpages to shared zeropage when splitting isolated thp 2024-09-09 16:39:03 -07:00
migrate.c mm: remove redundant condition for THP folio 2024-11-06 20:11:16 -08:00
mincore.c mm: provide mm_struct and address to huge_ptep_get() 2024-07-12 15:52:15 -07:00
mlock.c mm/mlock: set the correct prev on failure 2024-11-07 14:14:58 -08:00
mm_init.c alloc_tag: support for page allocation tag compression 2024-11-07 14:25:16 -08:00
mm_slot.h
mmap_lock.c mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount 2024-11-11 17:22:28 -08:00
mmap.c mm: remove unnecessary page_table_lock on stack expansion 2024-11-11 13:09:43 -08:00
mmu_gather.c mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing 2024-02-22 15:27:17 -08:00
mmu_notifier.c mm: move internal core VMA manipulation functions to own file 2024-09-01 20:25:54 -07:00
mmzone.c mm: improve code consistency with zonelist_* helper functions 2024-09-01 20:25:55 -07:00
mprotect.c mm: add PTE_MARKER_GUARD PTE marker 2024-11-11 00:26:44 -08:00
mremap.c mm/mremap: remove goto from mremap_to() 2024-11-06 20:11:16 -08:00
mseal.c mm: madvise: implement lightweight guard page mechanism 2024-11-11 00:26:45 -08:00
msync.c
nommu.c mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling 2024-11-05 16:49:55 -08:00
numa_emulation.c mm: introduce numa_emulation 2024-09-03 21:15:31 -07:00
numa_memblks.c mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id 2024-10-28 21:40:40 -07:00
numa.c mm: make range-to-target_node lookup facility a part of numa_memblks 2024-09-03 21:15:32 -07:00
oom_kill.c mm: move mm flags to mm_types.h 2024-11-05 16:56:26 -08:00
page_alloc.c Merge branch 'mm-hotfixes-stable' into mm-stable 2024-11-11 00:04:10 -08:00
page_counter.c mm, memcg: cg2 memory{.swap,}.peak write handlers 2024-09-01 20:25:53 -07:00
page_ext.c mm: don't account memmap per-node 2024-08-15 22:16:14 -07:00
page_idle.c mm: page_idle: convert page idle to use a folio 2023-01-18 17:12:52 -08:00
page_io.c mm: add per-order mTHP swpin counters 2024-11-11 00:26:43 -08:00
page_isolation.c mm: remove migration for HugePage in isolate_single_pageblock() 2024-09-03 21:15:40 -07:00
page_owner.c mm/page-owner: use gfp_nested_mask() instead of open coded masking 2024-05-19 14:40:44 -07:00
page_poison.c mm/page_poison: replace kmap_atomic() with kmap_local_page() 2023-12-10 16:51:50 -08:00
page_reporting.c mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
page_reporting.h
page_table_check.c mm/page_table_check: fix crash on ZONE_DEVICE 2024-06-15 10:43:04 -07:00
page_vma_mapped.c mm: mass constification of folio/page pointers 2024-11-07 14:38:07 -08:00
page-writeback.c memcg-v1: no need for memcg locking for writeback tracking 2024-11-06 20:11:19 -08:00
pagewalk.c mm: pagewalk: add the ability to install PTEs 2024-11-11 00:26:44 -08:00
percpu-internal.h mm: remove CONFIG_MEMCG_KMEM 2024-07-10 12:14:54 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c percpu: clean up all mappings when pcpu_map_pages() fails 2024-04-25 20:55:49 -07:00
percpu.c mm: use page->private instead of page->index in percpu 2024-11-07 14:38:07 -08:00
pgalloc-track.h
pgtable-generic.c mm: pgtable: remove pte_offset_map_nolock() 2024-11-05 16:56:29 -08:00
process_vm_access.c mm: refactor mm_access() to not return NULL 2024-11-05 16:56:23 -08:00
ptdump.c mm: ptdump: add check_wx_pages debugfs attribute 2024-02-22 10:24:47 -08:00
readahead.c mm: don't set readahead flag on a folio when lookahead_size > nr_to_read 2024-11-06 20:11:15 -08:00
rmap.c mm: mass constification of folio/page pointers 2024-11-07 14:38:07 -08:00
rodata_test.c
secretmem.c secretmem: disable memfd_secret() if arch cannot set direct map 2024-10-09 12:47:19 -07:00
shmem_quota.c shmem_quota: build the object file conditionally to the config option 2024-09-01 20:25:45 -07:00
shmem.c mm: shmem: override mTHP shmem default with a kernel parameter 2024-11-11 13:09:43 -08:00
show_mem.c mm/show_mem: use str_yes_no() helper in show_free_areas() 2024-11-07 14:38:08 -08:00
shrinker_debug.c mm: shrinker: use min() to improve shrinker_debugfs_scan_write() 2024-09-03 21:15:40 -07:00
shrinker.c mm: shrinker: avoid memleak in alloc_shrinker_info 2024-10-31 20:27:04 -07:00
shuffle.c
shuffle.h mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
slab_common.c mm: krealloc: Fix MTE false alarm in __do_krealloc 2024-10-29 10:40:53 +01:00
slab.h mm, slab: suppress warnings in test_leak_destroy kunit test 2024-10-02 16:28:46 +02:00
slub.c mm, slab: suppress warnings in test_leak_destroy kunit test 2024-10-02 16:28:46 +02:00
sparse-vmemmap.c mm: define general function pXd_init() 2024-11-11 17:22:27 -08:00
sparse.c bootmem: stop using page->index 2024-11-07 14:38:07 -08:00
swap_cgroup.c mm: attempt to batch free swap entries for zap_pte_range() 2024-09-03 21:15:33 -07:00
swap_slots.c mm: swap: update get_swap_pages() to take folio order 2024-04-25 20:56:37 -07:00
swap_state.c mm: swap: use str_true_false() helper function 2024-11-06 20:11:14 -08:00
swap.c mm: delete the unused put_pages_list() 2024-11-11 00:26:45 -08:00
swap.h mm: fix swap_read_folio_zeromap() for large folios with partial zeromap 2024-09-17 01:07:01 -07:00
swapfile.c mm, swap: avoid over reclaim of full clusters 2024-10-30 20:14:11 -07:00
truncate.c mm/truncate: reset xa_has_values flag on each iteration 2024-11-06 20:11:09 -08:00
usercopy.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
userfaultfd.c mm: remove unused hugepage for vma_alloc_folio() 2024-11-06 20:11:12 -08:00
util.c mm: renovate page_address_in_vma() 2024-11-07 14:38:07 -08:00
vma_internal.h mm: isolate mmap internal logic to mm/vma.c 2024-11-06 20:11:19 -08:00
vma.c vma: detect infinite loop in vma tree 2024-11-11 13:09:42 -08:00
vma.h mm: isolate mmap internal logic to mm/vma.c 2024-11-06 20:11:19 -08:00
vmalloc.c alloc_tag: populate memory for module tags as needed 2024-11-07 14:25:16 -08:00
vmpressure.c eventfd: simplify eventfd_signal() 2023-11-28 14:08:38 +01:00
vmscan.c mm/vmscan: wake up flushers conditionally to avoid cgroup OOM 2024-11-07 14:38:07 -08:00
vmstat.c memcg/hugetlb: add hugeTLB counters to memcg 2024-11-14 22:49:19 -08:00
workingset.c mm/list_lru: simplify the list_lru walk callback function 2024-11-11 17:22:26 -08:00
z3fold.c mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_pool 2024-09-01 20:25:56 -07:00
zbud.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zpool.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zsmalloc.c mm/zsmalloc: use memcpy_from/to_page whereever possible 2024-11-07 14:38:07 -08:00
zswap.c mm/list_lru: simplify the list_lru walk callback function 2024-11-11 17:22:26 -08:00