linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-15 02:05:33 +00:00

History

Suren Baghdasaryan 2b6198eaf6 mm: replace vm_lock and detached flag with a reference count

rw_semaphore is a sizable structure of 40 bytes and consumes considerable
space for each vm_area_struct.  However vma_lock has two important
specifics which can be used to replace rw_semaphore with a simpler
structure:

1. Readers never wait.  They try to take the vma_lock and fall back to
   mmap_lock if that fails.

2. Only one writer at a time will ever try to write-lock a vma_lock
   because writers first take mmap_lock in write mode.  Because of these
   requirements, full rw_semaphore functionality is not needed and we can
   replace rw_semaphore and the vma->detached flag with a refcount
   (vm_refcnt).

When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state.  Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached() to
be done only after vma has been write-locked.  vma_mark_attached() changes
vm_refcnt to 1 to indicate that it has been attached to the vma tree. 
When a reader takes read lock, it increments vm_refcnt, unless the top
usable bit of vm_refcnt (0x40000000) is set, indicating presence of a
writer.  When writer takes write lock, it sets the top usable bit to
indicate its presence.  If there are readers, writer will wait using newly
introduced mm->vma_writer_wait.  Since all writers take mmap_lock in write
mode first, there can be only one writer at a time.  The last reader to
release the lock will signal the writer to wake up.  refcount might
overflow if there are many competing readers, in which case read-locking
will fail.  Readers are expected to handle such failures.  In summary:

1. all readers increment the vm_refcnt;
2. writer sets top usable (writer) bit of vm_refcnt;
3. readers cannot increment the vm_refcnt if the writer bit is set;
4. in the presence of readers, writer must wait for the vm_refcnt to drop
to 1 (ignoring the writer bit), indicating an attached vma with no readers;
5. vm_refcnt overflow is handled by the readers.

While this vm_lock replacement does not yet result in a smaller
vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
allows for further size optimization by structure member regrouping
to bring the size of vm_area_struct below 192 bytes.

Link: https://lkml.kernel.org/r/20250111042604.3230628-12-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: kernel test robot <oliver.sang@intel.com>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2025-01-12 21:39:10 -08:00

bpf

mm: alloc_pages_bulk: rename API

2025-01-12 21:39:04 -08:00

cgroup

cgroup: Changes for v6.13

2024-11-20 09:54:49 -08:00

configs

configs/debug: make sure PROVE_RCU_LIST=y takes effect

2024-10-28 10:21:09 -07:00

debug

kdb: fix ctrl+e/a/f/b/d/p/n broken in keyboard mode

2024-11-18 15:20:22 +00:00

dma

dma-debug: fix physical address calculation for struct dma_debug_entry

2024-11-28 10:19:16 +01:00

entry

sched: Add TIF_NEED_RESCHED_LAZY infrastructure

2024-11-05 12:55:37 +01:00

events

- The series "resource: A couple of cleanups" from Andy Shevchenko

2024-11-25 16:09:48 -08:00

futex

futex: fix user access on powerpc

2024-12-09 10:00:25 -08:00

gcov

gcov: add support for GCC 14

2024-06-15 10:43:06 -07:00

irq

genirq/proc: Add missing space separator back

2024-12-03 14:59:34 +01:00

kcsan

kcsan: Remove redundant call of kallsyms_lookup_name()

2024-10-14 16:44:56 +02:00

livepatch

livepatch: Replace snprintf() with sysfs_emit()

2024-07-02 16:56:18 +02:00

locking

percpu: use TYPEOF_UNQUAL() in variable declarations

2025-01-12 21:38:37 -08:00

module

module: Convert symbol namespace to string literal

2024-12-02 11:34:44 -08:00

power

The biggest change here is eliminating the awful idea that KVM had, of

2024-11-23 16:00:50 -08:00

printk

printk changes for 6.13

2024-11-20 09:21:11 -08:00

rcu

kasan: make kasan_record_aux_stack_noalloc() the default behaviour

2025-01-12 21:38:09 -08:00

sched

lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN

2025-01-12 21:38:16 -08:00

time

clocksource: Make negative motion detection more robust

2024-12-05 16:03:24 +01:00

trace

Fixes for ftrace in v6.13:

2025-01-03 10:04:43 -08:00

.gitignore

…

acct.c

kernel misc: Remove the now superfluous sentinel elements from ctl_table array

2024-04-24 09:43:53 +02:00

async.c

async: Use a dedicated unbound workqueue with raised min_active

2024-02-09 11:13:59 -10:00

audit_fsnotify.c

…

audit_tree.c

fsnotify: create a wrapper fsnotify_find_inode_mark()

2024-04-04 16:24:16 +02:00

audit_watch.c

fsnotify: create a wrapper fsnotify_find_inode_mark()

2024-04-04 16:24:16 +02:00

audit.c

lsm/stable-6.13 PR 20241112

2024-11-18 17:34:05 -08:00

audit.h

audit: change context data from secid to lsm_prop

2024-10-11 14:34:16 -04:00

auditfilter.c

audit: change context data from secid to lsm_prop

2024-10-11 14:34:16 -04:00

auditsc.c

audit: workaround a GCC bug triggered by task comm changes

2024-12-04 22:57:46 -05:00

backtracetest.c

backtracetest: add MODULE_DESCRIPTION()

2024-06-24 22:24:55 -07:00

bounds.c

bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS

2024-04-29 08:29:29 -07:00

capability.c

lsm: constify the 'target' parameter in security_capget()

2023-08-08 16:48:47 -04:00

cfi.c

…

compat.c

…

configs.c

…

context_tracking.c

context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching

2024-08-15 21:30:43 +05:30

cpu_pm.c

…

cpu.c

lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN

2025-01-12 21:38:16 -08:00

crash_core.c

kexec/crash: no crash update when kexec in progress

2024-11-05 17:12:27 -08:00

crash_reserve.c

crash: fix crash memory reserve exceed system memory bug

2024-09-01 20:43:30 -07:00

cred.c

cred: Add a light version of override/revert_creds()

2024-11-11 10:45:04 +01:00

delayacct.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

dma.c

…

elfcorehdr.c

crash: remove dependency of FA_DUMP on CRASH_DUMP

2024-02-23 17:48:22 -08:00

exec_domain.c

…

exit.c

remove pointless includes of <linux/fdtable.h>

2024-10-07 13:34:41 -04:00

exit.h

exit: add internal include file with helpers

2023-09-21 12:03:50 -06:00

extable.c

…

fail_function.c

…

fork.c

mm: replace vm_lock and detached flag with a reference count

2025-01-12 21:39:10 -08:00

freezer.c

sched/fair: Fix external p->on_rq users

2024-10-14 09:14:35 +02:00

gen_kheaders.sh

kheaders: use command -v to test for existence of cpio

2024-05-30 01:13:20 +09:00

groups.c

groups: Convert group_info.usage to refcount_t

2023-09-29 11:28:39 -07:00

hung_task.c

hung_task: add detect count for hung tasks

2024-11-11 17:17:03 -08:00

iomem.c

kernel/iomem.c: remove __weak ioremap_cache helper

2023-08-21 13:37:28 -07:00

irq_work.c

kasan: make kasan_record_aux_stack_noalloc() the default behaviour

2025-01-12 21:38:09 -08:00

jump_label.c

jump_label: Fix static_key_slow_dec() yet again

2024-09-10 11:57:27 +02:00

kallsyms_internal.h

kallsyms: get rid of code for absolute kallsyms

2024-07-20 16:33:21 +09:00

kallsyms_selftest.c

kallsyms: Match symbols exactly with CONFIG_LTO_CLANG

2024-08-15 09:33:35 -07:00

kallsyms_selftest.h

…

kallsyms.c

kallsyms: Match symbols exactly with CONFIG_LTO_CLANG

2024-08-15 09:33:35 -07:00

kcmp.c

get rid of ...lookup...fdget_rcu() family

2024-10-07 13:34:41 -04:00

Kconfig.freezer

…

Kconfig.hz

…

Kconfig.kexec

crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32

2024-11-14 22:43:48 -08:00

Kconfig.locks

…

Kconfig.preempt

sched: No PREEMPT_RT=y for all{yes,mod}config

2024-11-07 15:25:05 +01:00

kcov.c

kcov: mark in_softirq_really() as __always_inline

2024-12-30 17:59:08 -08:00

kexec_core.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

kexec_elf.c

…

kexec_file.c

kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y

2024-09-01 17:59:01 -07:00

kexec_internal.h

kexec: use atomic_try_cmpxchg_acquire() in kexec_trylock()

2024-09-01 20:43:23 -07:00

kexec.c

crash: add a new kexec flag for hotplug support

2024-04-23 14:59:01 +10:00

kheaders.c

…

kprobes.c

kprobes: Use struct_size() in __get_insn_slot()

2024-10-31 11:00:58 +09:00

ksyms_common.c

…

ksysfs.c

profiling: remove prof_cpu_mask

2024-07-29 10:45:54 -07:00

kthread.c

get rid of __get_task_comm()

2024-11-05 17:12:28 -08:00

latencytop.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

Makefile

mm: move kernel/numa.c to mm/

2024-09-03 21:15:26 -07:00

module_signature.c

…

notifier.c

reboot: move reboot_notifier_list to kernel/reboot.c

2024-11-05 17:12:31 -08:00

nsproxy.c

fdget(), trivial conversions

2024-11-03 01:28:06 -05:00

padata.c

padata: Clean up in padata_do_multithreaded()

2024-11-10 11:50:54 +08:00

panic.c

drm next for 6.12-rc1

2024-09-19 10:18:15 +02:00

params.c

params: Fix multi-line comment style

2023-12-01 09:51:44 -08:00

pid_namespace.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

pid_sysctl.h

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

pid.c

fdget(), more trivial conversions

2024-11-03 01:28:06 -05:00

profile.c

profiling: remove profile=sleep support

2024-08-04 13:36:28 -07:00

ptrace.c

ptrace_attach: shift send(SIGSTOP) into ptrace_set_stopped()

2024-02-22 15:38:52 -08:00

range.c

…

reboot.c

kernel/reboot: replace sprintf() with sysfs_emit()

2024-11-11 17:17:05 -08:00

regset.c

regset: use kvzalloc() for regset_get_alloc()

2024-04-25 21:07:03 -07:00

relay.c

[tree-wide] finally take no_llseek out

2024-09-27 08:18:43 -07:00

resource_kunit.c

resource, kunit: fix user-after-free in resource_test_region_intersects()

2024-10-09 12:47:19 -07:00

resource.c

module: Convert symbol namespace to string literal

2024-12-02 11:34:44 -08:00

rseq.c

…

scftorture.c

scftorture: Handle NULL argument passed to scf_add_to_free_list().

2024-11-14 16:09:51 -08:00

scs.c

…

seccomp.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

signal.c

posix-timers: Target group sigqueue to current task only if not exiting

2024-11-29 13:19:09 +01:00

smp.c

locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

2024-10-11 09:31:21 -07:00

smpboot.c

kthread: add kthread_stop_put

2023-10-04 10:41:57 -07:00

smpboot.h

…

softirq.c

softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel

2024-12-02 12:01:27 +01:00

stackleak.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

stacktrace.c

stacktrace: fix kernel-doc typo

2023-12-29 12:22:29 -08:00

static_call_inline.c

x86/static-call: provide a way to do very early static-call updates

2024-12-13 09:28:32 +01:00

static_call.c

…

stop_machine.c

rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs()

2024-08-15 21:30:42 +05:30

sys_ni.c

Probes updates for v6.11:

2024-07-18 12:19:20 -07:00

sys.c

arm64 updates for 6.13:

2024-11-18 18:10:37 -08:00

sysctl-test.c

sysctl: Add module description to sysctl-testing

2024-06-03 15:20:37 +02:00

sysctl.c

sysctl: Reorganize kerneldoc parameter names

2024-10-23 15:28:40 +02:00

task_work.c

kasan: make kasan_record_aux_stack_noalloc() the default behaviour

2025-01-12 21:38:09 -08:00

taskstats.c

fdget(), more trivial conversions

2024-11-03 01:28:06 -05:00

torture.c

torture: Add MODULE_DESCRIPTION()

2024-05-30 15:31:38 -07:00

tracepoint.c

tracing: Fix syscall tracepoint use-after-free

2024-11-01 14:37:31 -04:00

tsacct.c

tsacct: replace strncpy() with strscpy()

2024-07-12 16:39:53 -07:00

ucount.c

Summary

2024-11-22 20:36:11 -08:00

uid16.c

…

uid16.h

…

umh.c

remove pointless includes of <linux/fdtable.h>

2024-10-07 13:34:41 -04:00

up.c

smp: Change function signatures to use call_single_data_t

2023-09-13 14:59:24 +02:00

user_namespace.c

user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation

2024-09-09 16:47:42 -07:00

user-return-notifier.c

…

user.c

uidgid: make sure we fit into one cacheline

2024-09-12 12:16:09 +02:00

usermode_driver.c

…

utsname_sysctl.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

utsname.c

…

vhost_task.c

vhost_task: Handle SIGKILL by flushing work and exiting

2024-05-22 08:31:15 -04:00

vmcore_info.c

mm: support only one page_type per page

2024-09-03 21:15:43 -07:00

watch_queue.c

fdget(), trivial conversions

2024-11-03 01:28:06 -05:00

watchdog_buddy.c

…

watchdog_perf.c

watchdog/perf: properly initialize the turbo mode timestamp and rearm counter

2024-07-17 21:11:34 -07:00

watchdog.c

- The series "resource: A couple of cleanups" from Andy Shevchenko

2024-11-25 16:09:48 -08:00

workqueue_internal.h

workqueue: Drop the special locking rule for worker->flags and worker_pool->flags

2023-08-07 15:57:22 -10:00

workqueue.c

kasan: make kasan_record_aux_stack_noalloc() the default behaviour

2025-01-12 21:38:09 -08:00