linux/kernel
Tejun Heo 5797b1c189 workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.

In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.

However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.

While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.

636b927eba ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.

Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.

Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:

- One max_active enforcement decouples from pool boundaires, chaining
  execution after a work item finishes requires inter-pool operations which
  would require lock dancing, which is nasty.

- Sharing a single nr_active count across the whole system can be pretty
  expensive on NUMA machines.

- Per-pwq enforcement had been more or less okay while we were using
  per-node pools.

It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:

- To avoid sharing a single counter across multiple nodes, the configured
  max_active is split across nodes according to the proportion of each
  workqueue's online effective CPUs per node. e.g. A node with twice more
  online effective CPUs will get twice higher portion of max_active.

- Workqueue used to be able to process a chain of interdependent work items
  which is as long as max_active. We can't do this anymore as max_active is
  distributed across the nodes. Instead, a new parameter min_active is
  introduced which determines the minimum level of concurrency within a node
  regardless of how max_active distribution comes out to be.

  It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
  This can lead to higher effective max_weight than configured and also
  deadlocks if a workqueue was depending on being able to handle chains of
  interdependent work items that are longer than 8.

  I believe these should be fine given that the number of CPUs in each NUMA
  node is usually higher than 8 and work item chain longer than 8 is pretty
  unlikely. However, if these assumptions turn out to be wrong, we'll need
  to add an interface to adjust min_active.

- Each unbound wq has an array of struct wq_node_nr_active which tracks
  per-node nr_active. When its pwq wants to run a work item, it has to
  obtain the matching node's nr_active. If over the node's max_active, the
  pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
  the completion path round-robins the pending pwqs activating the first
  inactive work item of each, which involves some pool lock dancing and
  kicking other pools. It's not the simplest code but doesn't look too bad.

v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().

    - wq_adjust_max_active() is now protected by wq->mutex instead of
      wq_pool_mutex.

v3: - wq_node_max_active() used to calculate per-node max_active on the fly
      based on system-wide CPU online states. Lai pointed out that this can
      lead to skewed distributions for workqueues with restricted cpumasks.
      Update the max_active distribution to use per-workqueue effective
      online CPU counts instead of system-wide and cache the calculation
      results in node_nr_active->max.

v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 08:11:25 -10:00
..
bpf Networking changes for 6.8. 2024-01-11 10:07:29 -08:00
cgroup cgroup: Changes for v6.8 2024-01-08 20:04:02 -08:00
configs hardening: Provide Kconfig fragments for basic options 2023-09-22 09:50:55 -07:00
debug kdb: Corrects comment for kdballocenv 2023-11-06 17:13:55 +00:00
dma dma-mapping updates for Linux 6.8 2024-01-11 13:46:50 -08:00
entry entry: Move syscall_enter_from_user_mode() to header file 2023-12-21 23:12:18 +01:00
events Many singleton patches against the MM code. The patch series which 2024-01-09 11:18:47 -08:00
futex plist: Split out plist_types.h 2023-12-20 19:26:31 -05:00
gcov gcov: annotate struct gcov_iterator with __counted_by 2023-10-18 14:43:22 -07:00
irq As usual, lots of singleton and doubleton patches all over the tree and 2023-11-02 20:53:31 -10:00
kcsan mm: delete checks for xor_unlock_is_negative_byte() 2023-10-18 14:34:17 -07:00
livepatch livepatch: Fix missing newline character in klp_resolve_symbols() 2023-09-20 11:24:18 +02:00
locking RCU pull request for v6.8 2024-01-12 16:35:58 -08:00
module Modules changes for v6.8-rc1 2024-01-10 18:00:18 -08:00
power PM: hibernate: Repair excess function parameter description warning 2023-12-20 19:19:26 +01:00
printk TTY/Serial changes for 6.7-rc1 2023-11-03 15:44:25 -10:00
rcu Merge branches 'doc.2023.12.13a', 'torture.2023.11.23a', 'fixes.2023.12.13a', 'rcu-tasks.2023.12.12b' and 'srcu.2023.12.13a' into rcu-merge.2023.12.13a 2023-12-14 01:21:31 +05:30
sched header cleanups for 6.8 2024-01-10 16:43:55 -08:00
time Timer subsystem changes for v6.8: 2024-01-08 18:44:11 -08:00
trace Networking changes for 6.8. 2024-01-11 10:07:29 -08:00
.gitignore
acct.c fs: rename __mnt_{want,drop}_write*() helpers 2023-09-11 15:05:50 +02:00
async.c header cleanups for 6.8 2024-01-10 16:43:55 -08:00
audit_fsnotify.c
audit_tree.c As usual, lots of singleton and doubleton patches all over the tree and 2023-11-02 20:53:31 -10:00
audit_watch.c audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare() 2023-11-14 17:34:27 -05:00
audit.c audit: Send netlink ACK before setting connection in auditd_set 2023-11-12 22:33:49 -05:00
audit.h audit: correct audit_filter_inodes() definition 2023-07-21 12:17:25 -04:00
auditfilter.c audit: move trailing statements to next line 2023-08-15 18:16:14 -04:00
auditsc.c audit,io_uring: io_uring openat triggers audit reference count underflow 2023-10-13 18:34:46 +02:00
backtracetest.c
bounds.c mm: multi-gen LRU: minimal implementation 2022-09-26 19:46:09 -07:00
capability.c lsm: constify the 'target' parameter in security_capget() 2023-08-08 16:48:47 -04:00
cfi.c cfi: Switch to -fsanitize=kcfi 2022-09-26 10:13:13 -07:00
compat.c sched_getaffinity: don't assume 'cpumask_size()' is fully initialized 2023-03-14 19:32:38 -07:00
configs.c
context_tracking.c locking/atomic: treewide: use raw_atomic*_<op>() 2023-06-05 09:57:20 +02:00
cpu_pm.c cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}() 2023-01-13 11:48:15 +01:00
cpu.c slab updates for 6.8 2024-01-09 10:36:07 -08:00
crash_core.c Quite a lot of kexec work this time around. Many singleton patches in 2024-01-09 11:46:20 -08:00
crash_dump.c
cred.c cred: get rid of CONFIG_DEBUG_CREDENTIALS 2023-12-15 14:19:48 -08:00
delayacct.c delayacct: track delays from IRQ/SOFTIRQ 2023-04-18 16:39:34 -07:00
dma.c
exec_domain.c
exit.c header cleanups for 6.8 2024-01-10 16:43:55 -08:00
exit.h exit: add internal include file with helpers 2023-09-21 12:03:50 -06:00
extable.c
fail_function.c kernel/fail_function: fix memory leak with using debugfs_lookup() 2023-02-08 13:36:22 +01:00
fork.c header cleanups for 6.8 2024-01-10 16:43:55 -08:00
freezer.c Linux 6.7-rc6 2023-12-23 15:52:13 +01:00
gen_kheaders.sh Revert "kheaders: substituting --sort in archive creation" 2023-05-28 16:20:21 +09:00
groups.c groups: Convert group_info.usage to refcount_t 2023-09-29 11:28:39 -07:00
hung_task.c kernel/hung_task.c: set some hung_task.c variables storage-class-specifier to static 2023-04-08 13:45:37 -07:00
iomem.c kernel/iomem.c: remove __weak ioremap_cache helper 2023-08-21 13:37:28 -07:00
irq_work.c trace: Add trace_ipi_send_cpu() 2023-03-24 11:01:29 +01:00
jump_label.c jump_label: Prevent key->enabled int overflow 2022-12-01 15:53:05 -08:00
kallsyms_internal.h kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[] 2022-11-12 18:47:36 -08:00
kallsyms_selftest.c Modules changes for v6.6-rc1 2023-08-29 17:32:32 -07:00
kallsyms_selftest.h kallsyms: Add self-test facility 2022-11-15 00:42:02 -08:00
kallsyms.c kallsyms: Change func signature for cleanup_symbol_name() 2023-08-25 15:00:36 -07:00
kcmp.c file: convert to SLAB_TYPESAFE_BY_RCU 2023-10-19 11:02:48 +02:00
Kconfig.freezer
Kconfig.hz
Kconfig.kexec kexec: select CRYPTO from KEXEC_FILE instead of depending on it 2023-12-20 13:46:19 -08:00
Kconfig.locks
Kconfig.preempt
kcov.c kcov: add prototypes for helper functions 2023-06-09 17:44:17 -07:00
kexec_core.c kexec_core: fix the assignment to kimage->control_page 2023-12-29 12:22:29 -08:00
kexec_elf.c
kexec_file.c kexec_file: fix incorrect temp_start value in locate_mem_hole_top_down() 2023-12-29 12:22:25 -08:00
kexec_internal.h panic, kexec: make __crash_kexec() NMI safe 2022-09-11 21:55:06 -07:00
kexec.c kernel: kexec: copy user-array safely 2023-10-09 16:59:47 +10:00
kheaders.c kheaders: Use array declaration instead of char 2023-03-24 20:10:59 -07:00
kprobes.c kprobes: consistent rcu api usage for kretprobe holder 2023-12-01 14:53:55 +09:00
ksyms_common.c kallsyms: make kallsyms_show_value() as generic function 2023-06-08 12:27:20 -07:00
ksysfs.c crash: hotplug support for kexec_load() 2023-08-24 16:25:14 -07:00
kthread.c As usual, lots of singleton and doubleton patches all over the tree and 2023-11-02 20:53:31 -10:00
latencytop.c latencytop: use the last element of latency_record of system 2022-09-11 21:55:12 -07:00
Makefile kernel/numa.c: Move logging out of numa.h 2023-12-20 19:26:30 -05:00
module_signature.c
notifier.c notifiers: add tracepoints to the notifiers infrastructure 2023-04-08 13:45:38 -07:00
nsproxy.c nsproxy: Convert nsproxy.count to refcount_t 2023-08-21 11:29:12 -07:00
numa.c kernel/numa.c: Move logging out of numa.h 2023-12-20 19:26:30 -05:00
padata.c padata: Fix refcnt handling in padata_free_shell() 2023-10-27 18:04:24 +08:00
panic.c panic: use atomic_try_cmpxchg in panic() and nmi_panic() 2023-10-04 10:41:56 -07:00
params.c params: Fix multi-line comment style 2023-12-01 09:51:44 -08:00
pid_namespace.c wait: Remove uapi header file from main header file 2023-12-20 19:26:31 -05:00
pid_sysctl.h memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
pid.c file: remove __receive_fd() 2023-12-12 14:24:14 +01:00
profile.c kernel/profile.c: simplify duplicated code in profile_setup() 2022-09-11 21:55:12 -07:00
ptrace.c Quite a lot of kexec work this time around. Many singleton patches in 2024-01-09 11:46:20 -08:00
range.c
reboot.c Thermal control updates for 6.8-rc1 2024-01-09 16:20:17 -08:00
regset.c
relay.c kernel: relay: remove relay_file_splice_read dead code, doesn't work 2023-12-29 12:22:27 -08:00
resource_kunit.c
resource.c Quite a lot of kexec work this time around. Many singleton patches in 2024-01-09 11:46:20 -08:00
rseq.c rseq: Extend struct rseq with per-memory-map concurrency ID 2022-12-27 12:52:12 +01:00
scftorture.c scftorture: Pause testing after memory-allocation failure 2023-07-14 15:02:57 -07:00
scs.c scs: add support for dynamic shadow call stacks 2022-11-09 18:06:35 +00:00
seccomp.c file: remove __receive_fd() 2023-12-12 14:24:14 +01:00
signal.c kernel/signal.c: simplify force_sig_info_to_task(), kill recalc_sigpending_and_wake() 2023-12-10 17:21:32 -08:00
smp.c CSD lock commits for v6.7 2023-10-30 17:56:53 -10:00
smpboot.c kthread: add kthread_stop_put 2023-10-04 10:41:57 -07:00
smpboot.h
softirq.c sched/core: introduce sched_core_idle_cpu() 2023-07-13 15:21:50 +02:00
stackleak.c stackleak: allow to specify arch specific stackleak poison function 2023-04-20 11:36:35 +02:00
stacktrace.c stacktrace: fix kernel-doc typo 2023-12-29 12:22:29 -08:00
static_call_inline.c static_call: Add call depth tracking support 2022-10-17 16:41:16 +02:00
static_call.c
stop_machine.c
sys_ni.c lsm/stable-6.8 PR 20240105 2024-01-09 12:57:46 -08:00
sys.c prctl: Disable prctl(PR_SET_MDWE) on parisc 2023-11-18 19:35:31 +01:00
sysctl-test.c kernel/sysctl-test: use SYSCTL_{ZERO/ONE_HUNDRED} instead of i_{zero/one_hundred} 2022-09-08 16:56:45 -07:00
sysctl.c asm-generic updates for v6.7 2023-11-01 15:28:33 -10:00
task_work.c task_work: add kerneldoc annotation for 'data' argument 2023-09-19 13:21:32 -07:00
taskstats.c taskstats: fill_stats_for_tgid: use for_each_thread() 2023-10-04 10:41:57 -07:00
torture.c torture: Print out torture module parameters 2023-09-24 17:24:01 +02:00
tracepoint.c tracepoint: Allow livepatch module add trace event 2023-02-18 14:34:36 -05:00
tsacct.c
ucount.c sysctl: Add size to register_sysctl 2023-08-15 15:26:17 -07:00
uid16.c
uid16.h
umh.c sysctl: fix unused proc_cap_handler() function warning 2023-06-29 15:19:43 -07:00
up.c smp: Change function signatures to use call_single_data_t 2023-09-13 14:59:24 +02:00
user_namespace.c mnt_idmapping: decouple from namespaces 2023-11-28 14:08:47 +01:00
user-return-notifier.c
user.c binfmt_misc: enable sandboxed mounts 2023-10-11 08:46:01 -07:00
usermode_driver.c
utsname_sysctl.c utsname: simplify one-level sysctl registration for uts_kern_table 2023-04-13 11:49:35 -07:00
utsname.c
vhost_task.c vhost: Fix worker hangs due to missed wake up calls 2023-06-08 15:43:09 -04:00
watch_queue.c watch_queue: fix kcalloc() arguments order 2023-12-21 13:17:54 +01:00
watchdog_buddy.c watchdog/hardlockup: move SMP barriers from common code to buddy code 2023-06-19 16:25:28 -07:00
watchdog_perf.c watchdog/perf: add a weak function for an arch to detect if perf can use NMIs 2023-06-09 17:44:21 -07:00
watchdog.c watchdog: if panicking and we dumped everything, don't re-enable dumping 2023-12-29 12:22:30 -08:00
workqueue_internal.h workqueue: Drop the special locking rule for worker->flags and worker_pool->flags 2023-08-07 15:57:22 -10:00
workqueue.c workqueue: Implement system-wide nr_active enforcement for unbound workqueues 2024-01-29 08:11:25 -10:00