linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2024-12-29 17:23:36 +00:00

History

David Hildenbrand 1bafe96e89 mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared() We want to limit the use of page_mapcount() to places where absolutely required, to prepare for kernel configs where we won't keep track of per-page mapcounts in large folios. khugepaged is one of the remaining "more challenging" page_mapcount() users, but we might be able to move away from page_mapcount() without resulting in a significant behavior change that would warrant special-casing based on kernel configs. In 2020, we first added support to khugepaged for collapsing COW-shared pages via commit `9445689f3b` ("khugepaged: allow to collapse a page shared across fork"), followed by support for collapsing PTE-mapped THP in commit `5503fbf2b0` ("khugepaged: allow to collapse PTE-mapped compound pages") and limiting the memory waste via the "page_count() > 1" check in commit `71a2c112a0` ("khugepaged: introduce 'max_ptes_shared' tunable"). As a default, khugepaged will allow up to half of the PTEs to map shared pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged setting. khugepaged does currently not care about swapcache page references, and does not check under folio lock: so in some corner cases the "shared vs. exclusive" detection might be a bit off, making us detect "exclusive" when it's actually "shared". Most of our anonymous folios in the system are usually exclusive. We frequently see sharing of anonymous folios for a short period of time, after which our short-lived suprocesses either quit or exec(). There are some famous examples, though, where child processes exist for a long time, and where memory is COW-shared with a lot of processes (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for reducing the memory footprint. We don't want to suddenly change the behavior to result in a significant increase in memory waste. Interestingly, khugepaged will only collapse an anonymous THP if at least one PTE is writable. After fork(), that means that something (usually a page fault) populated at least a single exclusive anonymous THP in that PMD range. So ... what happens when we switch to "is this folio mapped shared" instead of "is this page mapped shared" by using folio_likely_mapped_shared()? For "not-COW-shared" folios, small folios and for THPs (large folios) that are completely mapped into at least one process, switching to folio_likely_mapped_shared() will not result in a change. We'll only see a change for COW-shared PTE-mapped THPs that are partially mapped into all involved processes. There are two cases to consider: (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP If the folio is detected as exclusive, and it actually is exclusive, there is no change: page_mapcount() == 1. This is the common case without fork() or with short-lived child processes. folio_likely_mapped_shared() might currently still detect a folio as exclusive although it is shared (false negatives): if the first page is not mapped multiple times and if the average per-page mapcount is smaller than 1, implying that (1) the folio is partially mapped and (2) if we are responsible for many mapcounts by mapping many pages others can't ("mostly exclusive") (3) if we are not responsible for many mapcounts by mapping little pages ("mostly shared") it won't make a big impact on the end result. So while we might now detect a page as "exclusive" although it isn't, it's not expected to make a big difference in common cases. (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP folio_likely_mapped_shared() will never detect a large anonymous folio as shared although it is exclusive: there are no false positives. If we detect a THP as shared, at least one page of the THP is mapped by another process. It could well be that some pages are actually exclusive. For example, our child processes could have unmapped/COW'ed some pages such that they would now be exclusive to out process, which we now would treat as still-shared. Examples: (1) Parent maps all pages of a THP, child maps some pages. We detect all pages in the parent as shared although some are actually exclusive. (2) Parent maps all but some page of a THP, child maps the remainder. We detect all pages of the THP that the parent maps as shared although they are all exclusive. In (1) we wouldn't collapse a THP right now already: no PTE is writable, because a write fault would have resulted in COW of a single page and the parent would no longer map all pages of that THP. For (2) we would have collapsed a THP in the parent so far, now we wouldn't as long as the child process is still alive: unless the child process unmaps the remaining THP pages or we decide to split that THP. Possibly, the child COW'ed many pages, meaning that it's likely that we can populate a THP for our child first, and then for our parent. For (2), we are making really bad use of the THP in the first place (not even mapped completely in at least one process). If the THP would be completely partially mapped, it would be on the deferred split queue where we would split it lazily later. For short-running child processes, we don't particularly care. For long-running processes, the expectation is that such scenarios are rather rare: further, a THP might be best placed if most data in the PMD range is actually written, implying that we'll have to COW more pages first before khugepaged would collapse it. To summarize, in the common case, this change is not expected to matter much. The more common application of khugepaged operates on exclusive pages, either before fork() or after a child quit. Can we improve (A)? Yes, if we implement more precise tracking of "mapped shared" vs. "mapped exclusively", we could get rid of the false negatives completely. Can we improve (B)? We could count how many pages of a large folio we map inside the current page table and detect that we are responsible for most of the folio mapcount and conclude "as good as exclusive", which might help in some cases. ... but likely, some other mechanism should detect that the THP is not a good use in the scenario (not even mapped completely in a single process) and try splitting that folio lazily etc. We'll move the folio_test_anon() check before our "shared" check, so we might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order of checks now matches the one in __collapse_huge_page_isolate(). Extend documentation. Link: https://lkml.kernel.org/r/20240424122630.495788-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2024-05-05 17:53:50 -07:00
..
ABI	mm: add docs for per-order mTHP counters and transhuge_page ABI	2024-05-05 17:53:36 -07:00
accel	docs/accel: correct links to mailing list archives	2024-01-23 14:45:50 -07:00
accounting
admin-guide	mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()	2024-05-05 17:53:50 -07:00
arch	Documentation/x86: Fix title underline length	2024-03-25 11:29:16 +01:00
block	Documentation: block: ioprio: Update schedulers	2024-01-18 08:21:14 -07:00
bpf	bpf, docs: Rename legacy conformance group to packet	2024-03-04 14:31:06 +01:00
cdrom
core-api	workqueue: Changes for v6.9	2024-03-11 12:50:42 -07:00
cpu-freq
crypto
dev-tools	Documentation: dev-tools: Add link to RV docs	2024-03-29 08:27:21 -06:00
devicetree	Merge tag 'drm-msm-next-2024-04-11' of https://gitlab.freedesktop.org/drm/msm into drm-fixes	2024-04-12 11:01:45 +10:00
doc-guide	docs: drop the version constraints for sphinx and dependencies	2024-03-03 08:17:20 -07:00
driver-api	mm: zswap: remove same_filled module params	2024-05-05 17:53:38 -07:00
fault-injection	Fixed case issue with 'fault-injection' in documentation	2024-02-21 13:44:21 -07:00
fb
features	membarrier: riscv: Provide core serializing command	2024-02-15 08:04:14 -08:00
filesystems	doc: split buffer.rst out of api-summary.rst	2024-05-05 17:53:40 -07:00
firmware_class
firmware-guide	More ACPI updates for 6.9-rc1	2024-03-19 11:15:14 -07:00
fpga
gpu	drm-misc-next for v6.9:	2024-02-26 09:51:49 +01:00
hid
hwmon	hwmon: (aspeed-g6-pwm-tacho): Support for ASPEED g6 PWM/Fan tach	2024-03-07 10:50:16 -08:00
i2c	Documentation: i2c: Document that client auto-detection is a legacy mechanism	2024-03-07 09:42:09 +01:00
iio	docs: iio: add documentation for adis16475 driver	2024-02-28 19:26:36 +00:00
images
infiniband
input
isdn
kbuild	Documentation/llvm: Note s390 LLVM=1 support with LLVM 18.1.0 and newer	2024-03-31 21:09:50 +09:00
kernel-hacking
leds
litmus-tests
livepatch
locking
maintainer	docs: maintainer: add existing SoC and netdev profiles	2024-02-05 10:05:57 -07:00
mhi
misc-devices	dt-bindings: misc: xlnx,sd-fec: convert bindings to yaml	2024-02-04 19:49:51 -06:00
mm	mm/page_table_check: support userfault wr-protect entries	2024-05-05 17:53:41 -07:00
netlabel
netlink	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	2024-03-11 20:38:36 -07:00
networking	Documentation: Add documentation for eswitch attribute	2024-03-28 18:20:08 -07:00
nvdimm
nvme
PCI
pcmcia
peci
power	Documentation: power: Fix typo in suspend and interrupts doc	2024-03-13 20:51:11 +01:00
process	A handful of late-arriving documentation fixes and enhancements.	2024-03-20 09:36:46 -07:00
RCU	A moderatly busy cycle for development this time around.	2024-03-12 15:18:34 -07:00
rust	arm64 updates for 6.9:	2024-03-14 15:35:42 -07:00
scheduler	A single update for the documentation of the base_slice_ns tunable to	2024-03-24 11:11:05 -07:00
scsi
security
sound	ALSA: doc: Use DEFINE_SIMPLE_DEV_PM_OPS()	2024-02-12 11:50:26 +01:00
sphinx	docs: drop the version constraints for sphinx and dependencies	2024-03-03 08:17:20 -07:00
sphinx-static
spi	spi: docs: spidev: fix echo command format	2024-03-19 18:37:55 +00:00
staging	docs: staging: fix typo in docs	2024-02-08 15:38:21 -07:00
target
tee
timers
tools	tools/rtla: Add -U/--user-load option to timerlat	2024-03-20 05:39:06 +01:00
trace	tracing/user_events: Document multi-format flag	2024-03-18 10:13:16 -04:00
translations	remove references to page->flags in documentation	2024-04-25 20:56:15 -07:00
usb	Documentation: usb: Document FunctionFS DMABUF API	2024-02-17 17:00:09 +01:00
userspace-api	media updates for v6.9-rc1	2024-03-15 11:36:54 -07:00
virt	Documentation: kvm/sev: clarify usage of KVM_MEMORY_ENCRYPT_OP	2024-03-18 19:03:53 -04:00
w1	w1: add UART w1 bus driver	2024-02-15 15:02:33 +01:00
watchdog
wmi	platform/x86: wmi: Update documentation regarding _WED	2024-02-27 14:44:31 +02:00
.gitignore
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py	docs: Restore "smart quotes" for quotes	2024-02-28 15:48:18 -07:00
docutils.conf
dontdiff
index.rst	A moderatly busy cycle for development this time around.	2024-03-12 15:18:34 -07:00
Kconfig
Makefile	docs: Makefile: Add dependency to $(YNL_INDEX) for targets other than htmldocs	2024-03-05 11:06:43 -07:00
memory-barriers.txt
SubmittingPatches
subsystem-apis.rst	docs: Fix subsystem APIs page so ungrouped entries have their own header	2024-01-30 14:02:32 -07:00