linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-08 15:04:45 +00:00

The linux-next integration testing tree

Go to file

Suren Baghdasaryan 0b6cc04f3d mm: introduce CONFIG_PER_VMA_LOCK Patch series "Per-VMA locks", v4. LWN article describing the feature: https://lwn.net/Articles/906852/ Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM last year [2], which concluded with suggestion that “a reader/writer semaphore could be put into the VMA itself; that would have the effect of using the VMA as a sort of range lock. There would still be contention at the VMA level, but it would be an improvement.” This patchset implements this suggested approach. When handling page faults we lookup the VMA that contains the faulting page under RCU protection and try to acquire its lock. If that fails we fall back to using mmap_lock, similar to how SPF handled this situation. One notable way the implementation deviates from the proposal is the way VMAs are read-locked. During some of mm updates, multiple VMAs need to be locked until the end of the update (e.g. vma_merge, split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks, figuring out when it's safe to unlock previously locked VMAs would make the code more complex. So, instead of the usual lock/unlock pattern, the proposed solution marks a VMA as locked and provides an efficient way to: 1. Identify locked VMAs. 2. Unlock all locked VMAs in bulk. We also postpone unlocking the locked VMAs until the end of the update, when we do mmap_write_unlock. Potentially this keeps a VMA locked for longer than is absolutely necessary but it results in a big reduction of code complexity. Read-locking a VMA is done using two sequence numbers - one in the vm_area_struct and one in the mm_struct. VMA is considered read-locked when these sequence numbers are equal. To read-lock a VMA we set the sequence number in vm_area_struct to be equal to the sequence number in mm_struct. To unlock all VMAs we increment mm_struct's seq number. This allows for an efficient way to track locked VMAs and to drop the locks on all VMAs at the end of the update. The patchset implements per-VMA locking only for anonymous pages which are not in swap and avoids userfaultfs as their implementation is more complex. Additional support for file-back page faults, swapped and user pages can be added incrementally. Performance benchmarks show similar although slightly smaller benefits as with SPF patchset (~75% of SPF benefits). Still, with lower complexity this approach might be more desirable. Since RFC was posted in September 2022, two separate Google teams outside of Android evaluated the patchset and confirmed positive results. Here are the known usecases when per-VMA locks show benefits: Android: Apps with high number of threads (~100) launch times improve by up to 20%. Each thread mmaps several areas upon startup (Stack and Thread-local storage (TLS), thread signal stack, indirect ref table), which requires taking mmap_lock in write mode. Page faults take mmap_lock in read mode. During app launch, both thread creation and page faults establishing the active workinget are happening in parallel and that causes lock contention between mm writers and readers even if updates and page faults are happening in different VMAs. Per-vma locks prevent this contention by providing more granular lock. Google Fibers: We have several dynamically sized thread pools that spawn new threads under increased load and reduce their number when idling. For example, Google's in-process scheduling/threading framework, UMCG/Fibers, is backed by such a thread pool. When idling, only a small number of idle worker threads are available; when a spike of incoming requests arrive, each request is handled in its own "fiber", which is a work item posted onto a UMCG worker thread; quite often these spikes lead to a number of new threads spawning. Each new thread needs to allocate and register an RSEQ section on its TLS, then register itself with the kernel as a UMCG worker thread, and only after that it can be considered by the in-process UMCG/Fiber scheduler as available to do useful work. In short, during an incoming workload spike new threads have to be spawned, and they perform several syscalls (RSEQ registration, UMCG worker registration, memory allocations) before they can actually start doing useful work. Removing any bottlenecks on this thread startup path will greatly improve our services' latencies when faced with request/workload spikes. At high scale, mmap_lock contention during thread creation and stack page faults leads to user-visible multi-second serving latencies in a similar pattern to Android app startup. Per-VMA locking patchset has been run successfully in limited experiments with user-facing production workloads. In these experiments, we observed that the peak thread creation rate was high enough that thread creation is no longer a bottleneck. TCP zerocopy receive: From the point of view of TCP zerocopy receive, the per-vma lock patch is massively beneficial. In today's implementation, a process with N threads where N - 1 are performing zerocopy receive and 1 thread is performing madvise() with the write lock taken (e.g. needs to change vm_flags) will result in all N -1 receive threads blocking until the madvise is done. Conversely, on a busy process receiving a lot of data, an madvise operation that does need to take the mmap lock in write mode will need to wait for all of the receives to be done - a lose:lose proposition. Per-VMA locking _removes_ by definition this source of contention entirely. There are other benefits for receive as well, chiefly a reduction in cacheline bouncing across receiving threads for locking/unlocking the single mmap lock. On an RPC style synthetic workload with 4KB RPCs: 1a) The find+lock+unlock VMA path in the base case, without the per-vma lock patchset, is about 0.7% of cycles as measured by perf. 1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5% cycles overall - most of this is within the TCP read hotpath (a small fraction is 'other' usage in the system). 2a) The find+lock+unlock VMA path, with the per-vma patchset and a trivial patch written to take advantage of it in TCP, is about 0.4% of cycles (down from 0.7% above) 2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is < 0.1% cycles and is out of the TCP read hotpath entirely (down from 0.5% before, the remaining usage is the 'other' usage in the system). So, in addition to entirely removing an onerous source of contention, it also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+ (compared to overall cycles in perf) for the 'small' RPC scenario. In https://lkml.kernel.org/r/87fsaqouyd.fsf_-_@stealth, Punit demonstrated throughput improvements of as much as 188% from this patchset. This patch (of 25): This configuration variable will be used to build the support for VMA locking during page fault handling. This is enabled on supported architectures with SMP and MMU set. The architecture support is needed since the page fault handler is called from the architecture's page faulting code which needs modifications to handle faults under VMA lock. Link: https://lkml.kernel.org/r/20230227173632.3292573-1-surenb@google.com Link: https://lkml.kernel.org/r/20230227173632.3292573-10-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2023-04-05 20:02:56 -07:00
arch	mips: fix comment about pgtable_init()	2023-04-05 19:42:52 -07:00
block	block: remove obsolete config BLOCK_COMPAT	2023-03-16 09:35:44 -06:00
certs	Kbuild updates for v6.3	2023-02-26 11:53:25 -08:00
crypto	asymmetric_keys: log on fatal failures in PE/pkcs7	2023-03-21 16:23:56 +00:00
Documentation	mm: hold the RCU read lock over calls to ->map_pages	2023-04-05 19:43:00 -07:00
drivers	drm/ttm: remove comment referencing now-removed vmf_insert_mixed_prot()	2023-04-05 19:42:56 -07:00
fs	afs: split afs_pagecache_valid() out of afs_validate()	2023-04-05 19:43:00 -07:00
include	trace: cma: remove unnecessary event class cma_alloc_class	2023-04-05 19:42:58 -07:00
init	init,mm: fold late call to page_ext_init() to page_alloc_init_late()	2023-04-05 19:42:54 -07:00
io_uring	block-6.3-2023-03-24	2023-03-24 14:10:39 -07:00
ipc	Merge branch 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2023-02-24 19:20:07 -08:00
kernel	mm, treewide: redefine MAX_ORDER sanely	2023-04-05 19:42:46 -07:00
lib	iov_iter: add copy_page_to_iter_nofault()	2023-04-05 19:42:57 -07:00
LICENSES	LICENSES: Add the copyleft-next-0.3.1 license	2022-11-08 15:44:01 +01:00
mm	mm: introduce CONFIG_PER_VMA_LOCK	2023-04-05 20:02:56 -07:00
net	mm, treewide: redefine MAX_ORDER sanely	2023-04-05 19:42:46 -07:00
rust	Rust fixes for 6.3-rc1	2023-03-03 14:51:15 -08:00
samples	LoongArch changes for v6.3	2023-03-01 09:27:00 -08:00
scripts	checksyscalls: ignore fstat to silence build warning on LoongArch	2023-03-23 17:18:32 -07:00
security	mm, treewide: redefine MAX_ORDER sanely	2023-04-05 19:42:46 -07:00
sound	ALSA: hda/ca0132: fixup buffer overrun at tuning_ctl_set()	2023-03-14 17:04:53 +01:00
tools	selftests/mm: set overcommit_policy as OVERCOMMIT_ALWAYS	2023-04-05 19:42:59 -07:00
usr	usr/gen_init_cpio.c: remove unnecessary -1 values from int file	2022-10-03 14:21:44 -07:00
virt	KVM/riscv changes for 6.3	2023-02-15 12:33:28 -05:00
.clang-format	cpumask: re-introduce constant-sized cpumask optimizations	2023-03-05 14:30:34 -08:00
.cocciconfig	scripts: add Linux .cocciconfig for coccinelle	2016-07-22 12:13:39 +02:00
.get_maintainer.ignore	get_maintainer: add Alan to .get_maintainer.ignore	2022-08-20 15:17:44 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for *.dtso files	2023-02-26 15:28:23 +09:00
.gitignore	kbuild: rpm-pkg: move source components to rpmbuild/SOURCES	2023-03-16 22:45:56 +09:00
.mailmap	mailmap: add an entry for Leonard Crestez	2023-03-28 15:24:32 -07:00
.rustfmt.toml	rust: add `.rustfmt.toml`	2022-09-28 09:02:20 +02:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	There is no particular theme here - mainly quick hits all over the tree.	2023-02-23 17:55:40 -08:00
Kbuild	Kbuild updates for v6.1	2022-10-10 12:00:45 -07:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	MAINTAINERS: extend memblock entry to include MM initialization	2023-04-05 19:42:55 -07:00
Makefile	Linux 6.3-rc4	2023-03-26 14:40:20 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.