linux-next/Documentation
Qi Zheng 1f07a84d2e mm: khugepaged: recheck pmd state in retract_page_tables()
Patch series "synchronously scan and reclaim empty user PTE pages", v4.

Previously, we tried to use a completely asynchronous method to reclaim
empty user PTE pages [1].  After discussing with David Hildenbrand, we
decided to implement synchronous reclaimation in the case of
madvise(MADV_DONTNEED) as the first step.

So this series aims to synchronously free the empty PTE pages in
madvise(MADV_DONTNEED) case.  We will detect and free empty PTE pages in
zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases
other than madvise(MADV_DONTNEED).

In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and
page freeing operations.  Therefore, if we want to free the empty PTE page
in this path, the most natural way is to add it to mmu_gather as well. 
Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free
page table pages by semi RCU:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

But this is not enough to free the empty PTE page table pages in paths
other that munmap and exit_mmap path, because IPI cannot be synchronized
with rcu_read_lock() in pte_offset_map{_lock}().  So we should let single
table also be freed by RCU like batch table freeing.

As a first step, we supported this feature on x86_64 and selectd the newly
introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.

For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.

Note: issues related to TLB flushing are not new to this series and are tracked
      in the separate RFC patch [3]. And more context please refer to this
      thread [4].

[1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
[3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
[4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/


This patch (of 12):

In retract_page_tables(), the lock of new_folio is still held, we will be
blocked in the page fault path, which prevents the pte entries from being
set again.  So even though the old empty PTE page may be concurrently
freed and a new PTE page is filled into the pmd entry, it is still empty
and can be removed.

So just refactor the retract_page_tables() a little bit and recheck the
pmd state after holding the pmd lock.

Link: https://lkml.kernel.org/r/cover.1733305182.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/70a51804cd19d44ccaf031825d9fb6eaf92f2bad.1733305182.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Suggested-by: Jann Horn <jannh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:50:45 -08:00
..
ABI linux-watchdog 6.13-rc1 tag 2024-12-05 10:03:43 -08:00
accel accel/qaic: Add AIC080 support 2024-10-12 14:51:04 -06:00
accounting
admin-guide docs: tmpfs: drop 'fadvise()' from the documentation 2024-12-18 19:50:32 -08:00
arch ACPI/IORT: Add PMCG platform information for HiSilicon HIP09A 2024-12-05 11:24:18 +00:00
block Documentation: ublk: document UBLK_F_USER_RECOVERY_FAIL_IO 2024-10-22 08:16:40 -06:00
bpf bpf: Remove trailing whitespace in verifier.rst 2024-11-11 08:17:48 -08:00
cdrom
core-api module: Convert default symbol namespace to string literal 2024-12-03 08:22:25 -08:00
cpu-freq
crypto crypto: doc - Fix akcipher title reference 2024-10-10 17:08:02 +08:00
dev-tools Kbuild updates for v6.13 2024-11-30 13:41:50 -08:00
devicetree USB driver fixes for 6.13-rc3 2024-12-14 09:35:22 -08:00
doc-guide Documentation: kernel-doc: enumerate identifier *type*s 2024-11-22 10:37:40 -07:00
driver-api Driver core changes for 6.13-rc1 2024-11-29 11:43:29 -08:00
fault-injection net: Implement fault injection forcing skb reallocation 2024-11-12 12:05:33 +01:00
fb
features riscv: Add qspinlock support 2024-11-11 07:33:20 -08:00
filesystems vfs-6.13-rc1.fixes 2024-11-27 08:11:46 -08:00
firmware_class
firmware-guide
fpga
gpu drm/amdgpu: Add documentation for enforce isolation feature 2024-11-08 11:45:29 -05:00
hid
hwmon hwmon: (tmp108) Add NXP p3t1085 support 2024-11-12 13:54:55 -08:00
i2c i2c-host updates for v6.13, part 1 2024-11-18 08:35:47 +01:00
iio docs: iio: ad7380: add adaq4370-4 and adaq4380-4 2024-11-09 10:42:03 +00:00
images
infiniband
input Input: fix the input_event struct documentation 2024-11-14 18:03:23 -08:00
isdn
kbuild Kbuild updates for v6.13 2024-11-30 13:41:50 -08:00
kernel-hacking docs/licensing: Clarify wording about "GPL" and "Proprietary" 2024-11-22 10:44:25 -07:00
leds - Limited LED current based on thermal conditions in the QCOM flash LED driver. 2024-09-23 14:20:11 -07:00
litmus-tests
livepatch
locking locking/Documentation: Fix grammar in percpu-rw-semaphore.rst 2024-11-13 10:59:01 +01:00
maintainer docs: Remove redundant word "for" 2024-10-21 09:32:20 -06:00
mhi
misc-devices
mm mm: khugepaged: recheck pmd state in retract_page_tables() 2024-12-18 19:50:45 -08:00
netlabel
netlink net: Add napi_struct parameter irq_suspend_timeout 2024-11-11 18:45:05 -08:00
networking Documentation: networking: Add a caveat to nexthop_compat_mode sysctl 2024-12-10 18:26:24 -08:00
nvdimm
nvme
PCI Merge branch 'pci/endpoint' 2024-11-25 13:40:56 -06:00
pcmcia
peci
power Documentation: PM: Clarify pm_runtime_resume_and_get() return value 2024-12-10 20:14:22 +01:00
process A few late-arriving fixes, plus two more significant changes that were 2024-11-26 13:44:27 -08:00
RCU doc: rcu: update printed dynticks counter bits 2024-11-12 21:40:24 +01:00
rust Rust changes for v6.13 2024-11-26 14:00:26 -08:00
scheduler sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local() 2024-11-11 07:06:16 -10:00
scsi
security landlock: Fix grammar issues in documentation 2024-10-21 20:36:26 +02:00
sound ASoC: Fixes for v6.13 2024-11-28 14:55:21 +01:00
sphinx
sphinx-static
spi
staging Documentation: Fix incorrect paths/magic in magic numbers rst 2024-11-04 12:34:59 -07:00
target
tee
timers timers/Documentation: Cleanup delay/sleep documentation 2024-10-16 00:36:48 +02:00
tools rtla: Documentation: Mention --deepest-idle-state 2024-10-17 17:13:16 -04:00
trace tracing: Record task flag NEED_RESCHED_LAZY. 2024-11-22 17:49:39 -05:00
translations module: Convert default symbol namespace to string literal 2024-12-03 08:22:25 -08:00
usb
userspace-api drm for 6.13-rc1 2024-11-21 14:56:17 -08:00
virt The biggest change here is eliminating the awful idea that KVM had, of 2024-11-23 16:00:50 -08:00
w1
watchdog watchdog: Delete the cpu5wdt driver 2024-11-05 10:04:39 +01:00
wmi platform-drivers-x86 for v6.13-1 2024-11-20 14:07:55 -08:00
.gitignore
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py
docutils.conf
index.rst
Kconfig
Makefile
memory-barriers.txt
SubmittingPatches
subsystem-apis.rst