The kernel sleep profile is no longer working due to a recursive locking
bug introduced by commit 42a20f86dc ("sched: Add wrapper for get_wchan()
to keep task blocked")
Booting with the 'profile=sleep' kernel command line option added or
executing
# echo -n sleep > /sys/kernel/profiling
after boot causes the system to lock up.
Lockdep reports
kthreadd/3 is trying to acquire lock:
ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: get_wchan+0x32/0x70
but task is already holding lock:
ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: try_to_wake_up+0x53/0x370
with the call trace being
lock_acquire+0xc8/0x2f0
get_wchan+0x32/0x70
__update_stats_enqueue_sleeper+0x151/0x430
enqueue_entity+0x4b0/0x520
enqueue_task_fair+0x92/0x6b0
ttwu_do_activate+0x73/0x140
try_to_wake_up+0x213/0x370
swake_up_locked+0x20/0x50
complete+0x2f/0x40
kthread+0xfb/0x180
However, since nobody noticed this regression for more than two years,
let's remove 'profile=sleep' support based on the assumption that nobody
needs this functionality.
Fixes: 42a20f86dc ("sched: Add wrapper for get_wchan() to keep task blocked")
Cc: stable@vger.kernel.org # v5.16+
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I got the following warn report while doing stress test:
jump label: negative count!
WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0
Call Trace:
<TASK>
__static_key_slow_dec_cpuslocked+0x16/0x70
sched_cpu_deactivate+0x26e/0x2a0
cpuhp_invoke_callback+0x3ad/0x10d0
cpuhp_thread_fun+0x3f5/0x680
smpboot_thread_fn+0x56d/0x8d0
kthread+0x309/0x400
ret_from_fork+0x41/0x70
ret_from_fork_asm+0x1b/0x30
</TASK>
Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(),
the cpu offline failed, but sched_smt_present is decremented before
calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so
fix it by incrementing sched_smt_present in the error path.
Fixes: c5511d03ec ("sched/smt: Make sched_smt_present track topology")
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns
In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530
unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
= -199925530
= (u64)18446744073709518790
Trigger condition:
1). User task run in kernel mode most of time
2). ARM64 architecture
3). TICK_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
Fixes: 3dc167ba57 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
- Update Daniel Bristot de Oliveira's entry in MAINTAINERS,
and credit him in CREDITS.
- Harmonize the lock-yielding behavior on dynamically selected
preemption models with static ones.
- Reorganize the code a bit: split out sched/syscalls.c to reduce
the size of sched/core.c
- Micro-optimize psi_group_change()
- Fix set_load_weight() for SCHED_IDLE tasks
- Misc cleanups & fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmaVtVARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iqTQ/9GLNzNBnl0oBWCiybeQjyWsZ6BiZi48R0
C1g9/RKy++OyGOjn/yqYK0Kg8cdfoGzHGioMMAucHFW1nXZwVw17xAJK127N0apF
83up7AnFJw/JGr1bI0FwuozqHAs4Z5KzHTv2KBxhYuO77lyYna6/t0liRUbF8ZUZ
I/nqav7wDB8RBIB5hEJ/uYLDX7qWdUlyFB+mcvV4ANA99yr++OgipCp6Ob3Rz3cP
O676nKJY4vpNbZ/B6bpKg8ezULRP8re2qD3GJRf2huS63uu/Z5ct7ouLVZ1DwN53
mFDBTYUMI2ToV0pseikuqwnmrjxAKcEajTyZpD3vckafd2TlWIopkQZoQ9XLLlIZ
DxO+KoekaHTSVy8FWlO8O+iE3IAdUUgECEpNveX45Pb7nFP+5dtFqqnVIdNqCq5e
zEuQvizaa5m+A1POZhZKya+z9jbLXXx+gtPCbbADTBWtuyl8azUIh3vjn0bykmv4
IVV/wvUm+BPEIhnKusZZOgB0vLtxUdntBBfUSxqoSOad9L+0/UtSKoKI6wvW00q8
ZkW+85yS3YFiN9W61276RLis2j7OAjE0eDJ96wfhooma2JRDJU4Wmg5oWg8x3WuA
JRmK0s63Qik5gpwG5rHQsR5jNqYWTj5Lp7So+M1kRfFsOM/RXQ/AneSXZu/P7d65
LnYWzbKu76c=
=lLab
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Update Daniel Bristot de Oliveira's entry in MAINTAINERS,
and credit him in CREDITS
- Harmonize the lock-yielding behavior on dynamically selected
preemption models with static ones
- Reorganize the code a bit: split out sched/syscalls.c to reduce
the size of sched/core.c
- Micro-optimize psi_group_change()
- Fix set_load_weight() for SCHED_IDLE tasks
- Misc cleanups & fixes
* tag 'sched-core-2024-07-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Update MAINTAINERS and CREDITS
sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks
sched/psi: Optimise psi_group_change a bit
sched/core: Drop spinlocks on contention iff kernel is preemptible
sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
sched/balance: Skip unnecessary updates to idle load balancer's flags
idle: Remove stale RCU comment
sched/headers: Move struct pre-declarations to the beginning of the header
sched/core: Clean up kernel/sched/sched.h a bit
sched/core: Simplify prefetch_curr_exec_start()
sched: Fix spelling in comments
sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c
doc.2024.06.06a: Update Tasks RCU and Tasks Rude RCU description in
Requirements.rst and clarify rcu_assign_pointer() and
rcu_dereference() ordering properties.
fixes.2024.07.04a: Add lockdep assertions for RCU readers, limit inline
wakeups for callback-bypass synchronize_rcu(), add an
rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter,
add Uladzislau Rezki as RCU maintainer, and fix a subtle
callback-migration memory-ordering issue.
mb.2024.06.28a: Remove a number of redundant memory barriers.
nocb.2024.06.03a: Remove unnecessary bypass-list lock-contention
mitigation, use parking API instead of open-coded ad-hoc
equivalent, and upgrade obsolete comments.
rcu-tasks.2024.06.06a: Revert avoidance of a deadlock that can no
longer occur and properly synchronize Tasks Trace RCU checking
of runqueues.
rcutorture.2024.06.06a: Add tests for handling of double-call_rcu()
bug, add missing MODULE_DESCRIPTION, and add a script that
histograms the number of calls to RCU updaters.
srcu.2024.06.18a: Fill out SRCU polled-grace-period API.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmaR7/QTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jGwAEACJKef2LryG6khoJdorWbvRf1V2k23H
19CxXexCE4UoGsgGST9z1/5rM8kBdNhdhQ0JB9CitW+zGlXpOM79/mO3gALKMj++
YBPw9B5EM622H2cKJGFzoHFSO4X9nM1CCMeuFCo6bVsbWfMtX3ENqsYl2IQy1JkB
pGiKqcNXGWU0mdUcZKs/8ilfLG1NhaLwrkfinlsP9V1+8z8LxxDH5Qh27AT3rIvu
W87OITTZoHlUaDVHYTautHTZoqM381xv9kNoQlS9lpH/gcFOPiO9DLj8NcLjkJ4y
S/OrxOwfQ+BGKwnk8daFQFAc3Nr9KeVAQH7CbOW7guARhj3z97J0+wPm6nZGEE2s
tDzg8zLT9LtbmUypJLurl29+wFE4fPNsnd69XDONbMFN1Ox2tJM3dd/rPCsHSUvz
kEOK9gUreHOv7/Ou6UIHlYVlHY7HHuD7TAsrhaaWk7CEmlY31UKwXG+fMl1FAnSy
F3PcBF/1M687RRFWVeMlug/+0/+ghtc+kZ1YyR79KZR6dI0C7ueQbCBGztCCtFDz
RjrHcDifS0Y2GNQO9+zAyrJvttidRATdYDeFstk+8nnta3CnYzxCp4rn5hs3Ss3N
AJVJm244jR3AcoL4V/tQwiQlYh9ZYN5tZ7qxFiASdtV50Uc8HoIrWXeP0Ar+GHiV
2z/f5fKF4+5clQ==
=7a1C
-----END PGP SIGNATURE-----
Merge tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Update Tasks RCU and Tasks Rude RCU description in Requirements.rst
and clarify rcu_assign_pointer() and rcu_dereference() ordering
properties
- Add lockdep assertions for RCU readers, limit inline wakeups for
callback-bypass synchronize_rcu(), add an
rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter, add
Uladzislau Rezki as RCU maintainer, and fix a subtle
callback-migration memory-ordering issue
- Remove a number of redundant memory barriers
- Remove unnecessary bypass-list lock-contention mitigation, use
parking API instead of open-coded ad-hoc equivalent, and upgrade
obsolete comments
- Revert avoidance of a deadlock that can no longer occur and properly
synchronize Tasks Trace RCU checking of runqueues
- Add tests for handling of double-call_rcu() bug, add missing
MODULE_DESCRIPTION, and add a script that histograms the number of
calls to RCU updaters
- Fill out SRCU polled-grace-period API
* tag 'rcu.2024.07.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits)
rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation
rcu: Eliminate lockless accesses to rcu_sync->gp_count
MAINTAINERS: Add Uladzislau Rezki as RCU maintainer
rcu: Add rcutree.nohz_full_patience_delay to reduce nohz_full OS jitter
rcu/exp: Remove redundant full memory barrier at the end of GP
rcu: Remove full memory barrier on RCU stall printout
rcu: Remove full memory barrier on boot time eqs sanity check
rcu/exp: Remove superfluous full memory barrier upon first EQS snapshot
rcu: Remove superfluous full memory barrier upon first EQS snapshot
rcu: Remove full ordering on second EQS snapshot
srcu: Fill out polled grace-period APIs
srcu: Update cleanup_srcu_struct() comment
srcu: Add NUM_ACTIVE_SRCU_POLL_OLDSTATE
srcu: Disable interrupts directly in srcu_gp_end()
rcu: Disable interrupts directly in rcu_gp_init()
rcu/tree: Reduce wake up for synchronize_rcu() common case
rcu/tasks: Fix stale task snaphot for Tasks Trace
tools/rcu: Add rcu-updaters.sh script
rcutorture: Add missing MODULE_DESCRIPTION() macros
rcutorture: Fix rcu_torture_fwd_cb_cr() data race
...
When a task's weight is being changed, set_load_weight() is called with
@update_load set. As weight changes aren't trivial for the fair class,
set_load_weight() calls fair.c::reweight_task() for fair class tasks.
However, set_load_weight() first tests task_has_idle_policy() on entry and
skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as
SCHED_IDLE tasks are just fair tasks with a very low weight and they would
incorrectly skip load, vlag and position updates.
Fix it by updating reweight_task() to take struct load_weight as idle weight
can't be expressed with prio and making set_load_weight() call
reweight_task() for SCHED_IDLE tasks too when @update_load is set.
Fixes: 9059393e4e ("sched/fair: Use reweight_entity() for set_user_nice()")
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # v4.15+
Link: http://lkml.kernel.org/r/20240624102331.GI31592@noisy.programming.kicks-ass.net
The current code loops over the psi_states only to call a helper which
then resolves back to the action needed for each state using a switch
statement. That is effectively creating a double indirection of a kind
which, given how all the states need to be explicitly listed and handled
anyway, we can simply remove. Both the for loop and the switch statement
that is.
The benefit is both in the code size and CPU time spent in this function.
YMMV but on my Steam Deck, while in a game, the patch makes the CPU usage
go from ~2.4% down to ~1.2%. Text size at the same time went from 0x323 to
0x2c1.
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20240625135000.38652-1-tursulin@igalia.com
It was reported that in moving to 6.1, a larger then 10%
regression was seen in the performance of
clock_gettime(CLOCK_THREAD_CPUTIME_ID,...).
Using a simple reproducer, I found:
5.10:
100000000 calls in 24345994193 ns => 243.460 ns per call
100000000 calls in 24288172050 ns => 242.882 ns per call
100000000 calls in 24289135225 ns => 242.891 ns per call
6.1:
100000000 calls in 28248646742 ns => 282.486 ns per call
100000000 calls in 28227055067 ns => 282.271 ns per call
100000000 calls in 28177471287 ns => 281.775 ns per call
The cause of this was finally narrowed down to the addition of
psi_account_irqtime() in update_rq_clock_task(), in commit
52b1364ba0 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ
pressure").
In my initial attempt to resolve this, I leaned towards moving
all accounting work out of the clock_gettime() call path, but it
wasn't very pretty, so it will have to wait for a later deeper
rework. Instead, Peter shared this approach:
Rework psi_account_irqtime() to use its own psi_irq_time base
for accounting, and move it out of the hotpath, calling it
instead from sched_tick() and __schedule().
In testing this, we found the importance of ensuring
psi_account_irqtime() is run under the rq_lock, which Johannes
Weiner helpfully explained, so also add some lockdep annotations
to make that requirement clear.
With this change the performance is back in-line with 5.10:
6.1+fix:
100000000 calls in 24297324597 ns => 242.973 ns per call
100000000 calls in 24318869234 ns => 243.189 ns per call
100000000 calls in 24291564588 ns => 242.916 ns per call
Reported-by: Jimmy Shiu <jimmyshiu@google.com>
Originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com
During the execution of the following stress test with linux-rt:
stress-ng --cyclic 30 --timeout 30 --minimize --quiet
kmemleak frequently reported a memory leak concerning the task_struct:
unreferenced object 0xffff8881305b8000 (size 16136):
comm "stress-ng", pid 614, jiffies 4294883961 (age 286.412s)
object hex dump (first 32 bytes):
02 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 .@..............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
debug hex dump (first 16 bytes):
53 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 S...............
backtrace:
[<00000000046b6790>] dup_task_struct+0x30/0x540
[<00000000c5ca0f0b>] copy_process+0x3d9/0x50e0
[<00000000ced59777>] kernel_clone+0xb0/0x770
[<00000000a50befdc>] __do_sys_clone+0xb6/0xf0
[<000000001dbf2008>] do_syscall_64+0x5d/0xf0
[<00000000552900ff>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
The issue occurs in start_dl_timer(), which increments the task_struct
reference count and sets a timer. The timer callback, dl_task_timer,
is supposed to decrement the reference count upon expiration. However,
if enqueue_task_dl() is called before the timer expires and cancels it,
the reference count is not decremented, leading to the leak.
This patch fixes the reference leak by ensuring the task_struct
reference count is properly decremented when the timer is canceled.
Fixes: feff2e65ef ("sched/deadline: Unthrottle PI boosted threads while enqueuing")
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20240620125618.11419-1-wander@redhat.com
This reverts commit b0defa7ae0.
b0defa7ae0 changed the load balancing logic to ignore env.max_loop if
all tasks examined to that point were pinned. The goal of the patch was
to make it more likely to be able to detach a task buried in a long list
of pinned tasks. However, this has the unfortunate side effect of
creating an O(n) iteration in detach_tasks(), as we now must fully
iterate every task on a cpu if all or most are pinned. Since this load
balance code is done with rq lock held, and often in softirq context, it
is very easy to trigger hard lockups. We observed such hard lockups with
a user who affined O(10k) threads to a single cpu.
When I discussed this with Vincent he initially suggested that we keep
the limit on the number of tasks to detach, but increase the number of
tasks we can search. However, after some back and forth on the mailing
list, he recommended we instead revert the original patch, as it seems
likely no one was actually getting hit by the original issue.
Fixes: b0defa7ae0 ("sched/fair: Make sure to try to detach at least one movable task")
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240620214450.316280-1-joshdon@google.com
When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running
on all online CPUs, no explicit ordering synchronizes properly with a
context switch. This lack of ordering can permit the new task to miss
pre-grace-period update-side accesses. The following diagram, courtesy
of Paul, shows the possible bad scenario:
CPU 0 CPU 1
----- -----
// Pre-GP update side access
WRITE_ONCE(*X, 1);
smp_mb();
r0 = rq->curr;
RCU_INIT_POINTER(rq->curr, TASK_B)
spin_unlock(rq)
rcu_read_lock_trace()
r1 = X;
/* ignore TASK_B */
Either r0==TASK_B or r1==1 is needed but neither is guaranteed.
One possible solution to solve this is to wait for an RCU grace period
at the beginning of the RCU-tasks-trace grace period before taking the
current tasks snaphot. However this would introduce large additional
latencies to RCU-tasks-trace grace periods.
Another solution is to lock the target runqueue while taking the current
task snapshot. This ensures that the update side sees the latest context
switch and subsequent context switches will see the pre-grace-period
update side accesses.
This commit therefore adds runqueue locking to cpu_curr_snapshot().
Fixes: e386b67257 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
We observed that the overhead on trigger_load_balance(), now renamed
sched_balance_trigger(), has risen with a system's core counts.
For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
trigger_load_balance(). On older systems with fewer cores/socket, this
function's overhead was less than 0.1%.
The cause of this overhead was that there are multiple cpus calling
kick_ilb(flags), updating the balancing work needed to a common idle
load balancer cpu. The ilb_cpu's flags field got updated unconditionally
with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags
causes much cache bouncing and cpu cycles overhead. This is seen in the
annotated profile below.
kick_ilb():
if (ilb_cpu < 0)
test %r14d,%r14d
↑ js 6c
flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
mov $0x2d600,%rdi
movslq %r14d,%r8
mov %rdi,%rdx
add -0x7dd0c3e0(,%r8,8),%rdx
arch_atomic_read():
0.01 mov 0x64(%rdx),%esi
35.58 add $0x64,%rdx
arch_atomic_fetch_or():
static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
{
int val = arch_atomic_read(v);
do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
0.03 157: mov %r12d,%ecx
arch_atomic_try_cmpxchg():
return arch_try_cmpxchg(&v->counter, old, new);
0.00 mov %esi,%eax
arch_atomic_fetch_or():
do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
or %esi,%ecx
arch_atomic_try_cmpxchg():
return arch_try_cmpxchg(&v->counter, old, new);
0.01 lock cmpxchg %ecx,(%rdx)
42.96 ↓ jne 2d2
kick_ilb():
With instrumentation, we found that 81% of the updates do not result in
any change in the ilb_cpu's flags. That is, multiple cpus are asking
the ilb_cpu to do the same things over and over again, before the ilb_cpu
has a chance to run NOHZ load balance.
Skip updates to ilb_cpu's flags if no new work needs to be done.
Such updates do not change ilb_cpu's NOHZ flags. This requires an extra
atomic read but it is less expensive than frequent unnecessary atomic
updates that generate cache bounces.
We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
(or sched_balance_trigger()) got reduced from 0.7% to 0.2%.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240531205452.65781-1-tim.c.chen@linux.intel.com
The call of rcu_idle_enter() from within cpuidle_idle_call() was
removed in commit 1098582a0f ("sched,idle,rcu: Push rcu_idle deeper
into the idle path") which makes the comment out of place.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/5b936388-47df-4050-9229-6617a6c2bba5@arm.com
There's a random number of structure pre-declaration lines in
kernel/sched/sched.h, some of which are unnecessary duplicates.
Move them to the head & order them a bit for readability.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Remove unnecessary use of the address operator.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
core.c has become rather large, move most scheduler syscall
related functionality into a separate file, syscalls.c.
This is about ~15% of core.c's raw linecount.
Move the alloc_user_cpus_ptr(), __rt_effective_prio(),
rt_effective_prio(), uclamp_none(), uclamp_se_set()
and uclamp_bucket_id() inlines to kernel/sched/sched.h.
Internally export the __sched_setscheduler(), __sched_setaffinity(),
__setscheduler_prio(), set_load_weight(), enqueue_task(), dequeue_task(),
check_class_changed(), splice_balance_callbacks() and balance_callbacks()
methods to better facilitate this.
Move the new file's build to sched_policy.c, because it fits there
semantically, but also because it's the smallest of the 4 build units
under an allmodconfig build:
-rw-rw-r-- 1 mingo mingo 7.3M May 27 12:35 kernel/sched/core.i
-rw-rw-r-- 1 mingo mingo 6.4M May 27 12:36 kernel/sched/build_utility.i
-rw-rw-r-- 1 mingo mingo 6.3M May 27 12:36 kernel/sched/fair.i
-rw-rw-r-- 1 mingo mingo 5.8M May 27 12:36 kernel/sched/build_policy.i
This better balances build time for scheduler subsystem rebuilds.
I build-tested this new file as a standalone syscalls.o file for a bit,
to make sure all the encapsulations & abstractions are robust.
Also update/add my copyright notices to these files.
Build time measurements:
# -Before/+After:
kepler:~/tip> perf stat -e 'cycles,instructions,duration_time' --sync --repeat 5 --pre 'rm -f kernel/sched/*.o' m kernel/sched/built-in.a >/dev/null
Performance counter stats for 'm kernel/sched/built-in.a' (5 runs):
- 71,938,508,607 cycles ( +- 0.17% )
+ 71,992,916,493 cycles ( +- 0.22% )
- 106,214,780,964 instructions # 1.48 insn per cycle ( +- 0.01% )
+ 105,450,231,154 instructions # 1.46 insn per cycle ( +- 0.01% )
- 5,878,232,620 ns duration_time ( +- 0.38% )
+ 5,290,085,069 ns duration_time ( +- 0.21% )
- 5.8782 +- 0.0221 seconds time elapsed ( +- 0.38% )
+ 5.2901 +- 0.0111 seconds time elapsed ( +- 0.21% )
Build time improvement of -11.1% (duration_time) is expected: the
parallel build time of the scheduler subsystem is determined by the
largest, slowest to build object file, which is kernel/sched/core.o.
By moving ~15% of its complexity into another build unit, we reduced
build time by -11%.
Measured cycles spent on building is within its ~0.2% stddev noise envelope.
The -0.7% reduction in instructions spent on building the scheduler is
statistically reliable and somewhat surprising - I can only speculate:
maybe compilers aren't that efficient at building & optimizing 10+ KLOC files
(core.c), and it's an overall win to balance the linecount a bit.
Anyway, this might be a data point that suggests that reducing the linecount
of our largest files will improve not just code readability and maintainability,
but might also improve build times a bit.
Code generation got a bit worse, by 0.5kb text on an x86 defconfig build:
# -Before/+After:
kepler:~/tip> size vmlinux
text data bss dec hex filename
-26475475 10439178 1740804 38655457 24dd5e1 vmlinux
+26476003 10439178 1740804 38655985 24dd7f1 vmlinux
kepler:~/tip> size kernel/sched/built-in.a
text data bss dec hex filename
- 76056 30025 489 106570 1a04a kernel/sched/core.o (ex kernel/sched/built-in.a)
+ 63452 29453 489 93394 16cd2 kernel/sched/core.o (ex kernel/sched/built-in.a)
44299 2181 104 46584 b5f8 kernel/sched/fair.o (ex kernel/sched/built-in.a)
- 42764 3424 120 46308 b4e4 kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
+ 55651 4044 120 59815 e9a7 kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
This is primarily due to the extra functions exported, and the size
gets exaggerated somewhat by __pfx CFI function padding:
ffffffff810cc710 <__pfx_enqueue_task>:
ffffffff810cc710: 90 nop
ffffffff810cc711: 90 nop
ffffffff810cc712: 90 nop
ffffffff810cc713: 90 nop
ffffffff810cc714: 90 nop
ffffffff810cc715: 90 nop
ffffffff810cc716: 90 nop
ffffffff810cc717: 90 nop
ffffffff810cc718: 90 nop
ffffffff810cc719: 90 nop
ffffffff810cc71a: 90 nop
ffffffff810cc71b: 90 nop
ffffffff810cc71c: 90 nop
ffffffff810cc71d: 90 nop
ffffffff810cc71e: 90 nop
ffffffff810cc71f: 90 nop
AFAICS the cost is primarily not to core.o and fair.o though (which contain
most performance sensitive scheduler functions), only to syscalls.o
that get called with much lower frequency - so I think this is an acceptable
trade-off for better code separation.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20240407084319.1462211-2-mingo@kernel.org
Hi Linus,
Please pull patches for 6.10. This includes:
- topology_span_sane() optimization from Kyle Meyer;
- fns() rework from Kuan-Wei Chiu (used in
cpumask_local_spread() and other places); and
- headers cleanup from Andy.
This also adds a MAINTAINERS record for bitops API as it's unattended,
and I'd like to follow it closer.
Thanks,
Yury
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmZKh/kACgkQsUSA/Tof
vshtSQv/eT5+KyXg5qCY3fLaIjWYD0uch5jxkdqtib5BncfIrUMsFpZBon+E2x9C
fWu7K/nfxUjKZF0Sfgl9gVns6K0rC4F24WzHjzWRVVV7+g4idXwMC1kxSX733KQC
o+D2065Dx9EmhnzypBbmNsGQsQ09WXP1GsJLf8qSGCw0lT1zNtgqsAD5sSogFGGn
ca9ZsndThuzTst5lXPXipt1W/c26frchh6SgjVTPjzALCDAf5r9Ls5np3AL1AW8X
yR8cuV9UphT1ysBplzPbBET/Fy/AGbZl1g4u72M6NvGy/nVkQ5Ic4HZj0zIem0Ic
C60PokY8lg6hQ7tWN8da12/g6WZINgZcfUfuodKiQAzryBGUJlW0aDzDUZPcCqB/
gmV/Op4RPJeQr9sibQ6nIFx73ydKVQEmZRliahzXR0p33HJCOLTATOeYqLTXQMdi
ZwhYCqG5fNEUK0VMBy8S4+tEsUAoykU21hFD04b/Ur8A49bxxJ9RDlAUC0IEc1Pj
fiU0VPFx
=H6BQ
-----END PGP SIGNATURE-----
Merge tag 'bitmap-for-6.10v2' of https://github.com/norov/linux
Pull bitmap updates from Yury Norov:
- topology_span_sane() optimization from Kyle Meyer
- fns() rework from Kuan-Wei Chiu (used in cpumask_local_spread() and
other places)
- headers cleanup from Andy
- add a MAINTAINERS record for bitops API
* tag 'bitmap-for-6.10v2' of https://github.com/norov/linux:
usercopy: Don't use "proxy" headers
bitops: Move aligned_byte_mask() to wordpart.h
MAINTAINERS: add BITOPS API record
bitmap: relax find_nth_bit() limitation on return value
lib: make test_bitops compilable into the kernel image
bitops: Optimize fns() for improved performance
lib/test_bitops: Add benchmark test for fns()
Compiler Attributes: Add __always_used macro
sched/topology: Optimize topology_span_sane()
cpumask: Add for_each_cpu_from()
Summary
* Removed sentinel elements from ctl_table structs in kernel/*
Removing sentinels in ctl_table arrays reduces the build time size and
runtime memory consumed by ~64 bytes per array. Removals for net/, io_uring/,
mm/, ipc/ and security/ are set to go into mainline through their respective
subsystems making the next release the most likely place where the final
series that removes the check for proc_name == NULL will land. This PR adds
to removals already in arch/, drivers/ and fs/.
* Adjusted ctl_table definitions and references to allow constification
Adjustments:
- Removing unused ctl_table function arguments
- Moving non-const elements from ctl_table to ctl_table_header
- Making ctl_table pointers const in ctl_table_root structure
Making the static ctl_table structs const will increase safety by keeping the
pointers to proc_handler functions in .rodata. Though no ctl_tables where
made const in this PR, the ground work for making that possible has started
with these changes sent by Thomas Weißschuh.
Testing
* These changes went into linux-next after v6.9-rc4; giving it a good month of
testing.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmZFvBMACgkQupfNUreW
QU/eGAv9EWeiXKxr3EVSMAsb9MWbJq7C99I/pd5hMf+qH4PgJpKDH7w/sb2e8h8+
unGiW83ikgrtph7OS4/xM3Y9r3Nvzd6C/OztqgMnNKeRFdMgP7wu9HaSNs05ordb
CqJdhvL93quc5HxrGTS9sdLK/wLJWOHwuWMXhX4qS44JNxTdPV2q10Rb7DZyHZ6O
C9qp61L2Q2CrnOBKIx8MoeCh20ynJQAo3b0pTN63ZYF4D0vqCcnYNNTPkge4ID8/
ULJoP5hAQY0vJ4g4fC4Gmooa5GECpm8MfZUf3SdgPyauqM/sm3dVdsLXAWD4Phcp
TsG2a/5KMYwnLHrUGwDW7bFfEemRU88h0Iam56+SKMl1kMlEpWaLL9ApQXoHFayG
e10izS+i/nlQiqYIHtuczCoTimT4/LGnonCLcdA//C3XzBT5MnOd7xsjuaQSpFWl
/CV9SZa4ABwzX7u2jty8ik90iihLCFQyKj1d9m1mDVbgb6r3iUOxVuHBgMtY7MF7
eyaEmV7l
=/rQW
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Remove sentinel elements from ctl_table structs in kernel/*
Removing sentinels in ctl_table arrays reduces the build time size
and runtime memory consumed by ~64 bytes per array. Removals for
net/, io_uring/, mm/, ipc/ and security/ are set to go into mainline
through their respective subsystems making the next release the most
likely place where the final series that removes the check for
proc_name == NULL will land.
This adds to removals already in arch/, drivers/ and fs/.
- Adjust ctl_table definitions and references to allow constification
- Remove unused ctl_table function arguments
- Move non-const elements from ctl_table to ctl_table_header
- Make ctl_table pointers const in ctl_table_root structure
Making the static ctl_table structs const will increase safety by
keeping the pointers to proc_handler functions in .rodata. Though no
ctl_tables where made const in this PR, the ground work for making
that possible has started with these changes sent by Thomas
Weißschuh.
* tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: drop now unnecessary out-of-bounds check
sysctl: move sysctl type to ctl_table_header
sysctl: drop sysctl_is_perm_empty_ctl_table
sysctl: treewide: constify argument ctl_table_root::permissions(table)
sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
bpf: Remove the now superfluous sentinel elements from ctl_table array
delayacct: Remove the now superfluous sentinel elements from ctl_table array
kprobes: Remove the now superfluous sentinel elements from ctl_table array
printk: Remove the now superfluous sentinel elements from ctl_table array
scheduler: Remove the now superfluous sentinel elements from ctl_table array
seccomp: Remove the now superfluous sentinel elements from ctl_table array
timekeeping: Remove the now superfluous sentinel elements from ctl_table array
ftrace: Remove the now superfluous sentinel elements from ctl_table array
umh: Remove the now superfluous sentinel elements from ctl_table array
kernel misc: Remove the now superfluous sentinel elements from ctl_table array
In the cgroup v2 CPU subsystem, assuming we have a
cgroup named 'test', and we set cpu.max and cpu.max.burst:
# echo 1000000 > /sys/fs/cgroup/test/cpu.max
# echo 1000000 > /sys/fs/cgroup/test/cpu.max.burst
then we check cpu.max and cpu.max.burst:
# cat /sys/fs/cgroup/test/cpu.max
1000000 100000
# cat /sys/fs/cgroup/test/cpu.max.burst
1000000
Next we set cpu.max again and check cpu.max and
cpu.max.burst:
# echo 2000000 > /sys/fs/cgroup/test/cpu.max
# cat /sys/fs/cgroup/test/cpu.max
2000000 100000
# cat /sys/fs/cgroup/test/cpu.max.burst
1000
... we find that the cpu.max.burst value changed unexpectedly.
In cpu_max_write(), the unit of the burst value returned
by tg_get_cfs_burst() is microseconds, while in cpu_max_write(),
the burst unit used for calculation should be nanoseconds,
which leads to the bug.
To fix it, get the burst value directly from tg->cfs_bandwidth.burst.
Fixes: f4183717b3 ("sched/fair: Introduce the burstable CFS controller")
Reported-by: Qixin Liao <liaoqixin@huawei.com>
Signed-off-by: Cheng Yu <serein.chengyu@huawei.com>
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240424132438.514720-1-serein.chengyu@huawei.com
On 05/03/2024 15:05, Vincent Guittot wrote:
I'm fine with either and that was my first thought here, too, but it did seem like
the comment was mostly placed there to justify the 'unexpected' high utilization
when explicitly passing FREQUENCY_UTIL and the need to clamp it then.
So removing did feel slightly more natural to me anyway.
So alternatively:
From: Christian Loehle <christian.loehle@arm.com>
Date: Tue, 5 Mar 2024 09:34:41 +0000
Subject: [PATCH] sched/fair: Remove stale FREQUENCY_UTIL mention
effective_cpu_util() flags were removed, so remove mentioning of the
flag.
commit 9c0b4bb7f6 ("sched/cpufreq: Rework schedutil governor performance estimation")
reworked effective_cpu_util() removing enum cpu_util_type. Modify the
comment accordingly.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/0e2833ee-0939-44e0-82a2-520a585a0153@arm.com
Change se->load.weight to se_weight(se) in the calculation for the
initial util_avg to avoid unnecessarily inflating the util_avg by 1024
times.
The reason is that se->load.weight has the unit/scale as the scaled-up
load, while cfs_rg->avg.load_avg has the unit/scale as the true task
weight (as mapped directly from the task's nice/priority value). With
CONFIG_32BIT, the scaled-up load is equal to the true task weight. With
CONFIG_64BIT, the scaled-up load is 1024 times the true task weight.
Thus, the current code may inflate the util_avg by 1024 times. The
follow-up capping will not allow the util_avg value to go wild. But the
calculation should have the correct logic.
Signed-off-by: Dawei Li <daweilics@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Link: https://lore.kernel.org/r/20240315015916.21545-1-daweilics@gmail.com
Knowing domain's level exactly can be useful when setting
relax_domain_level or cpuset.sched_relax_domain_level
Usage:
cat /debug/sched/domains/cpu0/domain1/level
to dump cpu0 domain1's level.
SDM macro is not used because sd->level is 'int' and
it would hide the type mismatch between 'int' and 'u32'.
Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/9489b6475f6dd6fbc67c617752d4216fa094da53.1714488502.git.vitaly@bursov.com
Change relax_domain_level checks so that it would be possible
to include or exclude all domains from newidle balancing.
This matches the behavior described in the documentation:
-1 no request. use system default or follow request of others.
0 no search.
1 search siblings (hyperthreads in a core).
"2" enables levels 0 and 1, level_max excludes the last (level_max)
level, and level_max+1 includes all levels.
Fixes: 1d3504fcf5 ("sched, cpuset: customize sched domains, core")
Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
- Add cpufreq pressure feedback for the scheduler
- Rework misfit load-balancing wrt. affinity restrictions
- Clean up and simplify the code around ::overutilized and
::overload access.
- Simplify sched_balance_newidle()
- Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
handling that changed the output.
- Rework & clean up <asm/vtime.h> interactions wrt. arch_vtime_task_switch()
- Reorganize, clean up and unify most of the higher level
scheduler balancing function names around the sched_balance_*()
prefix.
- Simplify the balancing flag code (sched_balance_running)
- Miscellaneous cleanups & fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmZBtA0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gQEw//WiCiV7zTlWShSiG/g8GTfoAvl53QTWXF
0jQ8TUcoIhxB5VeGgxVG1srYt8f505UXjH7L0MJLrbC3nOgRCg4NK57WiQEachKK
HORIJHT0tMMsKIwX9D5Ovo4xYJn+j7mv7j/caB+hIlzZAbWk+zZPNWcS84p0ZS/4
appY6RIcp7+cI7bisNMGUuNZS14+WMdWoX3TgoI6ekgDZ7Ky+kQvkwGEMBXsNElO
qZOj6yS/QUE4Htwz0tVfd6h5svoPM/VJMIvl0yfddPGurfNw6jEh/fjcXnLdAzZ6
9mgcosETncQbm0vfSac116lrrZIR9ygXW/yXP5S7I5dt+r+5pCrBZR2E5g7U4Ezp
GjX1+6J9U6r6y12AMLRjadFOcDvxdwtszhZq4/wAcmS3B9dvupnH/w7zqY9ho3wr
hTdtDHoAIzxJh7RNEHgeUC0/yQX3wJ9THzfYltDRIIjHTuvl4d5lHgsug+4Y9ClE
pUIQm/XKouweQN9TZz2ULle4ZhRrR9sM9QfZYfirJ/RppmuKool4riWyQFQNHLCy
mBRMjFFsTpFIOoZXU6pD4EabOpWdNrRRuND/0yg3WbDat2gBWq6jvSFv2UN1/v7i
Un5jijTuN7t8yP5lY5Tyf47kQfLlA9bUx1v56KnF9mrpI87FyiDD3MiQVhDsvpGX
rP96BIOrkSo=
=obph
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Add cpufreq pressure feedback for the scheduler
- Rework misfit load-balancing wrt affinity restrictions
- Clean up and simplify the code around ::overutilized and
::overload access.
- Simplify sched_balance_newidle()
- Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
handling that changed the output.
- Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch()
- Reorganize, clean up and unify most of the higher level
scheduler balancing function names around the sched_balance_*()
prefix
- Simplify the balancing flag code (sched_balance_running)
- Miscellaneous cleanups & fixes
* tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
sched/pelt: Remove shift of thermal clock
sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
thermal/cpufreq: Remove arch_update_thermal_pressure()
sched/cpufreq: Take cpufreq feedback into account
cpufreq: Add a cpufreq pressure feedback for the scheduler
sched/fair: Fix update of rd->sg_overutilized
sched/vtime: Do not include <asm/vtime.h> header
s390/irq,nmi: Include <asm/vtime.h> header directly
s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
sched/vtime: Get rid of generic vtime_task_switch() implementation
sched/vtime: Remove confusing arch_vtime_task_switch() declaration
sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
sched/fair: Rename root_domain::overload to ::overloaded
sched/fair: Use helper functions to access root_domain::overload
sched/fair: Check root_domain::overload value before update
sched/fair: Combine EAS check with root_domain::overutilized access
sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
...
Optimize topology_span_sane() by removing duplicate comparisons.
Since topology_span_sane() is called inside of for_each_cpu(), each
previous CPU has already been compared against every other CPU. The
current CPU only needs to be compared against higher-numbered CPUs.
The total number of comparisons is reduced from N * (N - 1) to
N * (N - 1) / 2 on each non-NUMA scheduling domain level.
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
housekeeping_setup() checks cpumask_intersects(present, online) to ensure
that the kernel will have at least one housekeeping CPU after smp_init(),
but this doesn't work if the maxcpus= kernel parameter limits the number of
processors available after bootup.
For example, a kernel with "maxcpus=2 nohz_full=0-2" parameters crashes at
boot time on a virtual machine with 4 CPUs.
Change housekeeping_setup() to use cpumask_first_and() and check that the
returned CPU number is valid and less than setup_max_cpus.
Another corner case is "nohz_full=0" on a machine with a single CPU or with
the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty
and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the
WARN_ON(tick_nohz_full_running) in tick_sched_do_timer().
And how should the kernel interpret the "nohz_full=" parameter? It should
be silently ignored, but currently cpulist_parse() happily returns the
empty cpumask and this leads to the same problem.
Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask)
and do nothing in this case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240413141746.GA10008@redhat.com
Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not
include the boot CPU, which is no longer true after:
08ae95f4fd ("nohz_full: Allow the boot CPU to be nohz_full").
However after:
aae17ebb53 ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
the kernel will crash at boot time in this case; housekeeping_any_cpu()
returns an invalid CPU number until smp_init() brings the first
housekeeping CPU up.
Change housekeeping_any_cpu() to check the result of cpumask_any_and() and
return smp_processor_id() in this case.
This is just the simple and backportable workaround which fixes the
symptom, but smp_processor_id() at boot time should be safe at least for
type == HK_TYPE_TIMER, this more or less matches the tick_do_timer_boot_cpu
logic.
There is no worry about cpu_down(); tick_nohz_cpu_down() will not allow to
offline tick_do_timer_cpu (the 1st online housekeeping CPU).
Fixes: aae17ebb53 ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work")
Reported-by: Chris von Recklinghausen <crecklin@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240411143905.GA19288@redhat.com
Closes: https://lore.kernel.org/all/20240402105847.GA24832@redhat.com/
The optional shift of the clock used by thermal/hw load avg has been
introduced to handle case where the signal was not always a high frequency
hw signal. Now that cpufreq provides a signal for firmware and
SW pressure, we can remove this exception and always keep this PELT signal
aligned with other signals.
Mark sysctl_sched_migration_cost boot parameter as deprecated
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lore.kernel.org/r/20240326091616.3696851-6-vincent.guittot@linaro.org
Now that cpufreq provides a pressure value to the scheduler, rename
arch_update_thermal_pressure into HW pressure to reflect that it returns
a pressure applied by HW (i.e. with a high frequency change) and not
always related to thermal mitigation but also generated by max current
limitation as an example. Such high frequency signal needs filtering to be
smoothed and provide an value that reflects the average available capacity
into the scheduler time scale.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org
Aggregate the different pressures applied on the capacity of CPUs and
create a new function that returns the actual capacity of the CPU:
get_actual_cpu_capacity().
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Link: https://lore.kernel.org/r/20240326091616.3696851-3-vincent.guittot@linaro.org
sg_overloaded is used instead of sg_overutilized to update
rd->sg_overutilized.
Fixes: 4475cd8bfd ("sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240404155738.2866102-1-vincent.guittot@linaro.org
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
rm sentinel element from ctl_table arrays
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
It was possible to have pick_eevdf() return NULL, which then causes a
NULL-deref. This turned out to be due to entity_eligible() returning
falsely negative because of a s64 multiplcation overflow.
Specifically, reweight_eevdf() computes the vlag without considering
the limit placed upon vlag as update_entity_lag() does, and then the
scaling multiplication (remember that weight is 20bit fixed point) can
overflow. This then leads to the new vruntime being weird which then
causes the above entity_eligible() to go side-ways and claim nothing
is eligible.
Thus limit the range of vlag accordingly.
All this was quite rare, but fatal when it does happen.
Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@nz.home/
Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/
Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/
Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Reported-by: Sergei Trofimovich <slyich@gmail.com>
Reported-by: Igor Raits <igor@gooddata.com>
Reported-by: Breno Leitao <leitao@debian.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-and-tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240422082238.5784-1-xuewen.yan@unisoc.com
reweight_eevdf() only keeps V unchanged inside itself. When se !=
cfs_rq->curr, it would be dequeued from rb tree first. So that V is
changed and the result is wrong. Pass the original V to reweight_eevdf()
to fix this issue.
Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
[peterz: flip if() condition for clarity]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Link: https://lkml.kernel.org/r/20240306022133.81008-3-dtcccc@linux.alibaba.com
reweight_eevdf() needs the latest V to do accurate calculation for new
ve and vd. So update V unconditionally when se is runnable.
Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Suggested-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Link: https://lore.kernel.org/r/20240306022133.81008-2-dtcccc@linux.alibaba.com
The generic vtime_task_switch() implementation gets built only
if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an
architecture to implement arch_vtime_task_switch() callback at
the same time, which is confusing.
Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC
architecture only and vtime_task_switch() generic variant is rather
superfluous.
Simplify the whole vtime_task_switch() wiring by moving the existing
generic implementation to PowerPC.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2cb6e3caada93623f6d4f78ad938ac6cd0e2fda8.1712760275.git.agordeev@linux.ibm.com
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb()
which the core scheduler code has depended upon since commit:
commit 223baf9d17 ("sched: Fix performance regression introduced by mm_cid")
If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can
unset the actively used cid when it fails to observe active task after it
sets lazy_put.
There *is* a memory barrier between storing to rq->curr and _return to
userspace_ (as required by membarrier), but the rseq mm_cid has stricter
requirements: the barrier needs to be issued between store to rq->curr
and switch_mm_cid(), which happens earlier than:
- spin_unlock(),
- switch_to().
So it's fine when the architecture switch_mm() happens to have that
barrier already, but less so when the architecture only provides the
full barrier in switch_to() or spin_unlock().
It is a bug in the rseq switch_mm_cid() implementation. All architectures
that don't have memory barriers in switch_mm(), but rather have the full
barrier either in finish_lock_switch() or switch_to() have them too late
for the needs of switch_mm_cid().
Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the
generic barrier.h header, and use it in switch_mm_cid() for scheduler
transitions where switch_mm() is expected to provide a memory barrier.
Architectures can override smp_mb__after_switch_mm() if their
switch_mm() implementation provides an implicit memory barrier.
Override it with a no-op on x86 which implicitly provide this memory
barrier by writing to CR3.
Fixes: 223baf9d17 ("sched: Fix performance regression introduced by mm_cid")
Reported-by: levi.yun <yeoreum.yun@arm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86
Cc: <stable@vger.kernel.org> # 6.4.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com
SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an
unnecessary complication that only make the code harder to read and slower.
We only ever set them separately:
thule:~/tip> git grep SG_OVER kernel/sched/
kernel/sched/fair.c: set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
kernel/sched/fair.c: *sg_status |= SG_OVERUTILIZED;
kernel/sched/fair.c: *sg_status |= SG_OVERLOADED;
kernel/sched/fair.c: set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
kernel/sched/fair.c: sg_status & SG_OVERUTILIZED);
kernel/sched/fair.c: } else if (sg_status & SG_OVERUTILIZED) {
kernel/sched/fair.c: set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
kernel/sched/sched.h:#define SG_OVERLOADED 0x1 /* More than one runnable task on a CPU. */
kernel/sched/sched.h:#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
kernel/sched/sched.h: set_rd_overloaded(rq->rd, SG_OVERLOADED);
And use them separately, which results in suboptimal code:
/* update overload indicator if we are at root domain */
set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
/* Update over-utilization (tipping point, U >= 0) indicator */
set_rd_overutilized_status(env->dst_rq->rd,
Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats()
and its lower level functions, and change all of them to 'bool'.
Remove the now unused SG_OVERLOADED and SG_OVERUTILIZED flags.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVPhODZ8/nbsqbP@gmail.com
The _status() postfix has no real meaning, simplify the naming
and harmonize it with set_rd_overloaded().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
Follow the rename of the root_domain::overloaded flag.
Note that this also matches the SG_OVERUTILIZED flag better.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
It is silly to use an ambiguous noun instead of a clear adjective when naming
such a flag ...
Note how root_domain::overutilized already used a proper adjective.
rd->overloaded is now set to 1 when the root domain is overloaded.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com