linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-17 05:45:20 +00:00

History

Mathieu Desnoyers 7e019dcc47 sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.

These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.

However, intermittent workloads behaving in bursts spaced by more than
100ms on each CPU exhibit bad cache locality and degraded performance
compared to purely per-cpu data indexing, because concurrency IDs are
allocated over various CPUs and cores, therefore losing cache locality
of the associated data.

Introduce the following changes to improve per-mm-cid cache locality:

- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
  track of which mm_cid value was last used, and use it as a hint to
  attempt re-allocating the same concurrency ID the next time this
  mm/cpu needs to allocate a concurrency ID,

- Add a per-mm CPUs allowed mask, which keeps track of the union of
  CPUs allowed for all threads belonging to this mm. This cpumask is
  only set during the lifetime of the mm, never cleared, so it
  represents the union of all the CPUs allowed since the beginning of
  the mm lifetime (note that the mm_cpumask() is really arch-specific
  and tailored to the TLB flush needs, and is thus _not_ a viable
  approach for this),

- Add a per-mm nr_cpus_allowed to keep track of the weight of the
  per-mm CPUs allowed mask (for fast access),

- Add a per-mm max_nr_cid to keep track of the highest number of
  concurrency IDs allocated for the mm. This is used for expanding the
  concurrency ID allocation within the upper bound defined by:

    min(mm->nr_cpus_allowed, mm->mm_users)

  When the next unused CID value reaches this threshold, stop trying
  to expand the cid allocation and use the first available cid value
  instead.

  Spreading allocation to use all the cid values within the range

    [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]

  improves cache locality while preserving mm_cid compactness within the
  expected user limits,

- In __mm_cid_try_get, only return cid values within the range
  [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
  prevents allocating cids above the number of allowed cpus in
  rare scenarios where cid allocation races with a concurrent
  remote-clear of the per-mm/cpu cid. This improvement is made
  possible by the addition of the per-mm CPUs allowed mask,

- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
  t->nr_cpus_allowed. This criterion was really meant to compare
  the number of mm->mm_users to the number of CPUs allowed for the
  entire mm. Therefore, the prior comparison worked fine when all
  threads shared the same CPUs allowed mask, but not so much in
  scenarios where those threads have different masks (e.g. each
  thread pinned to a single CPU). This improvement is made
  possible by the addition of the per-mm CPUs allowed mask.

* Benchmarks

Each thread increments 16kB worth of 8-bit integers in bursts, with
a configurable delay between each thread's execution. Each thread run
one after the other (no threads run concurrently). The order of
thread execution in the sequence is random. The thread execution
sequence begins again after all threads have executed. The 16kB areas
are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
(not cache-local), or cache-local mm_cid. Each thread is pinned to its
own core.

Testing configurations:

8-core/1-L3:        Use 8 cores within a single L3
24-core/24-L3:      Use 24 cores, 1 core per L3
192-core/24-L3:     Use 192 cores (all cores in the system)
384-thread/24-L3:   Use 384 HW threads (all HW threads in the system)

Intermittent workload delays between threads: 200ms, 10ms.

Hardware:

CPU(s):                   384
  On-line CPU(s) list:    0-383
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 9654 96-Core Processor
    Thread(s) per core:   2
    Core(s) per socket:   96
    Socket(s):            2
Caches (sum of all):
  L1d:                    6 MiB (192 instances)
  L1i:                    6 MiB (192 instances)
  L2:                     192 MiB (192 instances)
  L3:                     768 MiB (24 instances)

Each result is an average of 5 test runs. The cache-local speedup
is calculated as: (cache-local mm_cid) / (mm_cid).

Intermittent workload delay: 200ms

                     per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                         (ns)      (ns)                  (ns)
8-core/1-L3             1374      19289                  1336            14.4x
24-core/24-L3           2423      26721                  1594            16.7x
192-core/24-L3          2291      15826                  2153             7.3x
384-thread/24-L3        1874      13234                  1907             6.9x

Intermittent workload delay: 10ms

                     per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                         (ns)      (ns)                  (ns)
8-core/1-L3               662       756                   686             1.1x
24-core/24-L3            1378      3648                  1035             3.5x
192-core/24-L3           1439     10833                  1482             7.3x
384-thread/24-L3         1503     10570                  1556             6.8x

[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
  patch series with a simpler and more general approach. ]

[ This patch applies on top of v6.12-rc1. ]

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/

2024-10-14 12:52:40 +02:00

bpf

move asm/unaligned.h to linux/unaligned.h

2024-10-02 17:23:23 -04:00

cgroup

struct fd layout change (and conversion to accessor helpers)

2024-09-23 09:35:36 -07:00

configs

tinyconfig: remove unnecessary 'is not set' for choice blocks

2024-09-01 20:34:38 +09:00

debug

move asm/unaligned.h to linux/unaligned.h

2024-10-02 17:23:23 -04:00

dma

dma-mapping: report unlimited DMA addressing in IOMMU DMA path

2024-09-23 08:38:56 +02:00

entry

treewide: context_tracking: Rename CONTEXT_* into CT_STATE_*

2024-07-29 07:33:10 +05:30

events

sched/fair: Fix external p->on_rq users

2024-10-14 09:14:35 +02:00

futex

fault-inject: improve build for CONFIG_FAULT_INJECTION=n

2024-09-01 20:43:33 -07:00

gcov

gcov: add support for GCC 14

2024-06-15 10:43:06 -07:00

irq

pci-v6.12-changes

2024-09-23 12:47:06 -07:00

kcsan

kcsan: Use min() to fix Coccinelle warning

2024-08-01 16:40:44 -07:00

livepatch

livepatch: Replace snprintf() with sysfs_emit()

2024-07-02 16:56:18 +02:00

locking

Locking changes for v6.12:

2024-09-29 08:51:30 -07:00

module

Modules changes for v6.12-rc1

2024-09-28 09:06:15 -07:00

power

[tree-wide] finally take no_llseek out

2024-09-27 08:18:43 -07:00

printk

drm next for 6.12-rc1

2024-09-19 10:18:15 +02:00

rcu

Merge branch 'tip/sched/urgent'

2024-10-14 12:52:39 +02:00

sched

sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

2024-10-14 12:52:40 +02:00

time

sched/fair: Fix external p->on_rq users

2024-10-14 09:14:35 +02:00

trace

Merge branch 'tip/sched/urgent'

2024-10-14 12:52:39 +02:00

.gitignore

…

acct.c

kernel misc: Remove the now superfluous sentinel elements from ctl_table array

2024-04-24 09:43:53 +02:00

async.c

async: Use a dedicated unbound workqueue with raised min_active

2024-02-09 11:13:59 -10:00

audit_fsnotify.c

…

audit_tree.c

fsnotify: create a wrapper fsnotify_find_inode_mark()

2024-04-04 16:24:16 +02:00

audit_watch.c

fsnotify: create a wrapper fsnotify_find_inode_mark()

2024-04-04 16:24:16 +02:00

audit.c

audit: Make use of str_enabled_disabled() helper

2024-09-03 16:35:16 -04:00

audit.h

…

auditfilter.c

audit: use task_tgid_nr() instead of task_pid_nr()

2024-08-28 16:48:28 -04:00

auditsc.c

audit: use task_tgid_nr() instead of task_pid_nr()

2024-08-28 16:48:28 -04:00

backtracetest.c

backtracetest: add MODULE_DESCRIPTION()

2024-06-24 22:24:55 -07:00

bounds.c

bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS

2024-04-29 08:29:29 -07:00

capability.c

…

cfi.c

…

compat.c

…

configs.c

…

context_tracking.c

context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching

2024-08-15 21:30:43 +05:30

cpu_pm.c

…

cpu.c

Updates for timers and timekeeping:

2024-09-17 07:25:37 +02:00

crash_core.c

Document/kexec: generalize crash hotplug description

2024-09-01 20:43:37 -07:00

crash_reserve.c

crash: fix crash memory reserve exceed system memory bug

2024-09-01 20:43:30 -07:00

cred.c

cred: Use KMEM_CACHE() instead of kmem_cache_create()

2024-02-23 17:33:31 -05:00

delayacct.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

dma.c

…

elfcorehdr.c

crash: remove dependency of FA_DUMP on CRASH_DUMP

2024-02-23 17:48:22 -08:00

exec_domain.c

…

exit.c

ALong with the usual shower of singleton patches, notable patch series in

2024-09-21 07:29:05 -07:00

exit.h

…

extable.c

…

fail_function.c

…

fork.c

sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

2024-10-14 12:52:40 +02:00

freezer.c

sched/fair: Fix external p->on_rq users

2024-10-14 09:14:35 +02:00

gen_kheaders.sh

kheaders: use command -v to test for existence of cpio

2024-05-30 01:13:20 +09:00

groups.c

…

hung_task.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

iomem.c

…

irq_work.c

…

jump_label.c

jump_label: Fix static_key_slow_dec() yet again

2024-09-10 11:57:27 +02:00

kallsyms_internal.h

kallsyms: get rid of code for absolute kallsyms

2024-07-20 16:33:21 +09:00

kallsyms_selftest.c

kallsyms: Match symbols exactly with CONFIG_LTO_CLANG

2024-08-15 09:33:35 -07:00

kallsyms_selftest.h

…

kallsyms.c

kallsyms: Match symbols exactly with CONFIG_LTO_CLANG

2024-08-15 09:33:35 -07:00

kcmp.c

file: convert to SLAB_TYPESAFE_BY_RCU

2023-10-19 11:02:48 +02:00

Kconfig.freezer

…

Kconfig.hz

…

Kconfig.kexec

crash: clean up kdump related config items

2024-02-23 17:48:22 -08:00

Kconfig.locks

…

Kconfig.preempt

sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]

2024-08-01 07:08:01 -10:00

kcov.c

Updates for KCOV instrumentation on x86:

2024-09-17 12:40:34 +02:00

kexec_core.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

kexec_elf.c

…

kexec_file.c

kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y

2024-09-01 17:59:01 -07:00

kexec_internal.h

kexec: use atomic_try_cmpxchg_acquire() in kexec_trylock()

2024-09-01 20:43:23 -07:00

kexec.c

crash: add a new kexec flag for hotplug support

2024-04-23 14:59:01 +10:00

kheaders.c

…

kprobes.c

kprobes: Fix to check symbol prefixes correctly

2024-08-05 14:04:03 +09:00

ksyms_common.c

…

ksysfs.c

profiling: remove prof_cpu_mask

2024-07-29 10:45:54 -07:00

kthread.c

kthread: Fix task state in kthread worker if being frozen

2024-09-10 09:51:14 +02:00

latencytop.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

Makefile

mm: move kernel/numa.c to mm/

2024-09-03 21:15:26 -07:00

module_signature.c

…

notifier.c

…

nsproxy.c

introduce fd_file(), convert all accessors to it.

2024-08-12 22:00:43 -04:00

padata.c

This update includes the following changes:

2024-09-16 06:28:28 +02:00

panic.c

drm next for 6.12-rc1

2024-09-19 10:18:15 +02:00

params.c

params: Fix multi-line comment style

2023-12-01 09:51:44 -08:00

pid_namespace.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

pid_sysctl.h

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

pid.c

introduce fd_file(), convert all accessors to it.

2024-08-12 22:00:43 -04:00

profile.c

profiling: remove profile=sleep support

2024-08-04 13:36:28 -07:00

ptrace.c

ptrace_attach: shift send(SIGSTOP) into ptrace_set_stopped()

2024-02-22 15:38:52 -08:00

range.c

…

reboot.c

kernel misc: Remove the now superfluous sentinel elements from ctl_table array

2024-04-24 09:43:53 +02:00

regset.c

regset: use kvzalloc() for regset_get_alloc()

2024-04-25 21:07:03 -07:00

relay.c

[tree-wide] finally take no_llseek out

2024-09-27 08:18:43 -07:00

resource_kunit.c

resource, kunit: add test case for region_intersects()

2024-09-17 01:07:00 -07:00

resource.c

ALong with the usual shower of singleton patches, notable patch series in

2024-09-21 07:29:05 -07:00

rseq.c

…

scftorture.c

scftorture: Make torture_type static

2024-05-30 15:31:51 -07:00

scs.c

…

seccomp.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

signal.c

Revert "binfmt_elf, coredump: Log the reason of the failed core dumps"

2024-09-26 11:39:02 -07:00

smp.c

smp: print only local CPU info when sched_clock goes backward

2024-08-15 00:06:48 +05:30

smpboot.c

…

smpboot.h

…

softirq.c

softirq: use bit waits instead of var waits.

2024-10-07 09:28:39 +02:00

stackleak.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

stacktrace.c

stacktrace: fix kernel-doc typo

2023-12-29 12:22:29 -08:00

static_call_inline.c

static_call: Replace pointless WARN_ON() in static_call_module_notify()

2024-09-06 16:29:22 +02:00

static_call.c

…

stop_machine.c

rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs()

2024-08-15 21:30:42 +05:30

sys_ni.c

Probes updates for v6.11:

2024-07-18 12:19:20 -07:00

sys.c

struct fd layout change (and conversion to accessor helpers)

2024-09-23 09:35:36 -07:00

sysctl-test.c

sysctl: Add module description to sysctl-testing

2024-06-03 15:20:37 +02:00

sysctl.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

task_work.c

sched/core: Disable page allocation in task_tick_mm_cid()

2024-10-11 10:49:32 +02:00

taskstats.c

introduce fd_file(), convert all accessors to it.

2024-08-12 22:00:43 -04:00

torture.c

torture: Add MODULE_DESCRIPTION()

2024-05-30 15:31:38 -07:00

tracepoint.c

tracepoint: Support iterating tracepoints in a loading module

2024-09-25 23:23:44 +09:00

tsacct.c

tsacct: replace strncpy() with strscpy()

2024-07-12 16:39:53 -07:00

ucount.c

sysctl changes for v6.10-rc1

2024-05-17 17:31:24 -07:00

uid16.c

…

uid16.h

…

umh.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

up.c

…

user_namespace.c

user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation

2024-09-09 16:47:42 -07:00

user-return-notifier.c

…

user.c

uidgid: make sure we fit into one cacheline

2024-09-12 12:16:09 +02:00

usermode_driver.c

…

utsname_sysctl.c

sysctl: treewide: constify the ctl_table argument of proc_handlers

2024-07-24 20:59:29 +02:00

utsname.c

…

vhost_task.c

vhost_task: Handle SIGKILL by flushing work and exiting

2024-05-22 08:31:15 -04:00

vmcore_info.c

mm: support only one page_type per page

2024-09-03 21:15:43 -07:00

watch_queue.c

introduce fd_file(), convert all accessors to it.

2024-08-12 22:00:43 -04:00

watchdog_buddy.c

…

watchdog_perf.c

watchdog/perf: properly initialize the turbo mode timestamp and rearm counter

2024-07-17 21:11:34 -07:00

watchdog.c

watchdog: handle the ENODEV failure case of lockup_detector_delay_init() separately

2024-09-01 20:43:32 -07:00

workqueue_internal.h

…

workqueue.c

workqueue: Changes for v6.12

2024-09-18 06:59:44 +02:00