linux-stable/Documentation/core-api
Tejun Heo 8639ecebc9 workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.

While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.

While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.

This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.

After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:

* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
  ->__pod_cpumask so that the workers are allowed to run on any CPU that
  the associated workqueues allow.

* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
  the field to a CPU within the pod.

This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.

There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.

While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.

v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-07 15:57:25 -10:00
..
irq Documentation: irqdomain: Fix typo of "at least once" 2022-08-18 11:11:52 -06:00
wrappers docs: put atomic*.txt and memory-barriers.txt into the core-api book 2022-09-29 12:55:06 -06:00
asm-annotations.rst docs: move x86 documentation into Documentation/arch/ 2023-03-30 12:58:51 -06:00
assoc_array.rst Documentation: Use "while" instead of "whilst" 2018-11-20 09:30:43 -07:00
boot-time-mm.rst docs/boot-time-mm: remove bootmem documentation 2018-10-31 08:54:16 -07:00
cachetlb.rst mm: Add flush_dcache_folio() 2021-10-18 07:49:36 -04:00
circular-buffers.rst doc: Remove ".vnet" from paulmck email addresses 2019-05-28 09:02:57 -07:00
cpu_hotplug.rst x86/topology: Remove CPU0 hotplug option 2023-05-15 13:44:49 +02:00
debug-objects.rst doc: debugobjects: actually pull in the kerneldoc comments 2016-11-29 14:44:14 -07:00
debugging-via-ohci1394.rst docs: debugging-via-ohci1394.txt: add it to the core-api book 2020-05-15 11:59:17 -06:00
dma-api-howto.rst dma-api-howto: typo fix 2023-04-10 16:46:11 -06:00
dma-api.rst docs/mm: Physical Memory: remove useless markup 2023-02-02 10:18:04 -07:00
dma-attributes.rst Reinstate some of "swiotlb: rework "fix info leak with DMA_FROM_DEVICE"" 2022-03-28 11:37:05 -07:00
dma-isa-lpc.rst docs: core-api: avoid using ReST :doc:foo markup 2021-06-17 13:24:37 -06:00
entry.rst Documentation: core-api: entry: Add comments about nesting 2022-01-27 11:32:40 -07:00
errseq.rst errseq: Add to documentation tree 2018-01-01 12:40:27 -07:00
genalloc.rst lib/genalloc.c: rename addr_in_gen_pool to gen_pool_has_addr 2019-12-04 19:44:13 -08:00
generic-radix-tree.rst generic radix trees 2019-03-12 10:04:02 -07:00
genericirq.rst docs: genericirq.rst: don't document chip.c functions twice 2020-10-15 07:49:41 +02:00
gfp_mask-from-fs-io.rst docs: core-api/gfp_mask-from-fs-io: add a label for cross-referencing 2018-09-20 11:02:32 -06:00
idr.rst IDR: Note that the IDR API is deprecated 2022-07-10 21:17:30 -04:00
index.rst docs: add more netlink docs (incl. spec docs) 2023-01-24 10:58:11 +01:00
kernel-api.rst It's been a relatively calm cycle in docsland. We do have: 2023-06-27 11:33:47 -07:00
kobject.rst kobject documentation: remove default_attrs information 2022-01-07 11:23:37 +01:00
kref.rst docs: move the kref doc into the core-api book 2020-05-15 12:02:19 -06:00
librs.rst docs-rst: convert librs book to ReST 2017-05-16 08:44:16 -03:00
local_ops.rst timers: Update the documentation to reflect on the new timer_shutdown() API 2022-11-24 15:09:12 +01:00
maple_tree.rst Maple Tree: add new data structure 2022-09-26 19:46:13 -07:00
memory-allocation.rst mm/slab: document kfree() as allowed for kmem_cache_alloc() objects 2023-03-29 10:35:41 +02:00
memory-hotplug.rst mm/memory_hotplug: remove HIGHMEM leftovers 2021-11-06 13:30:42 -07:00
mm-api.rst mm/page_alloc: remove obsolete gfpflags_normal_context() 2022-10-03 14:03:30 -07:00
netlink.rst docs: add more netlink docs (incl. spec docs) 2023-01-24 10:58:11 +01:00
packing.rst Documentation: core-api: packing: correct spelling 2023-02-15 21:40:54 -08:00
padata.rst Documentation: core-api: padata: correct spelling 2023-02-16 16:58:01 -07:00
pin_user_pages.rst mm: Don't pin ZERO_PAGE in pin_user_pages() 2023-05-31 09:48:15 -06:00
printk-basics.rst printk: Move the printk() kerneldoc comment to its new home 2021-07-26 12:36:44 +02:00
printk-formats.rst mm, printk: introduce new format %pGt for page_type 2023-03-28 16:20:09 -07:00
printk-index.rst printk/index: Printk index feature documentation 2022-04-13 14:25:31 +02:00
protection-keys.rst Documentation/protection-keys: Clean up documentation for User Space pkeys 2022-06-07 16:06:22 -07:00
rbtree.rst docs: rbtree.rst: Fix a typo 2021-03-25 11:38:51 -06:00
refcount-vs-atomic.rst docs: remove :c:func: from refcount-vs-atomic.rst 2019-10-07 09:08:56 -06:00
symbol-namespaces.rst doc: module: update file references 2022-07-01 14:50:01 -07:00
this_cpu_ops.rst arch: Remove cmpxchg_double 2023-06-05 09:36:39 +02:00
timekeeping.rst timekeeping: Introduce fast accessor to clock tai 2022-04-14 16:19:30 +02:00
tracepoint.rst doc: Sphinxify the tracepoint docbook 2016-11-29 14:44:23 -07:00
unaligned-memory-access.rst docs: move other kAPI documents to core-api 2020-06-26 11:33:42 -06:00
watch_queue.rst Documentation: move watch_queue to core-api 2022-04-22 09:47:25 -06:00
workqueue.rst workqueue: Implement non-strict affinity scope for unbound workqueues 2023-08-07 15:57:25 -10:00
xarray.rst XArray: Document the locking requirement for the xa_state 2022-02-03 15:56:50 -05:00