linux-stable/kernel/sched
Johannes Weiner f429f85e4c sched: psi: fix bogus pressure spikes from aggregation race
[ Upstream commit 3840cbe24c ]

Brandon reports sporadic, non-sensical spikes in cumulative pressure
time (total=) when reading cpu.pressure at a high rate. This is due to
a race condition between reader aggregation and tasks changing states.

While it affects all states and all resources captured by PSI, in
practice it most likely triggers with CPU pressure, since scheduling
events are so frequent compared to other resource events.

The race context is the live snooping of ongoing stalls during a
pressure read. The read aggregates per-cpu records for stalls that
have concluded, but will also incorporate ad-hoc the duration of any
active state that hasn't been recorded yet. This is important to get
timely measurements of ongoing stalls. Those ad-hoc samples are
calculated on-the-fly up to the current time on that CPU; since the
stall hasn't concluded, it's expected that this is the minimum amount
of stall time that will enter the per-cpu records once it does.

The problem is that the path that concludes the state uses a CPU clock
read that is not synchronized against aggregators; the clock is read
outside of the seqlock protection. This allows aggregators to race and
snoop a stall with a longer duration than will actually be recorded.

With the recorded stall time being less than the last snapshot
remembered by the aggregator, a subsequent sample will underflow and
observe a bogus delta value, resulting in an erratic jump in pressure.

Fix this by moving the clock read of the state change into the seqlock
protection. This ensures no aggregation can snoop live stalls past the
time that's recorded when the state concludes.

Reported-by: Brandon Duffany <brandon@buildbuddy.io>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
Fixes: df77430639 ("psi: Reduce calls to sched_clock() in psi")
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-17 15:22:06 +02:00
..
autogroup.c sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() 2022-08-12 11:25:10 +02:00
autogroup.h sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h 2022-02-23 08:22:00 +01:00
build_policy.c sched: Fix missing prototype warnings 2022-05-01 10:03:43 +02:00
build_utility.c sched: Fix missing prototype warnings 2022-05-01 10:03:43 +02:00
clock.c sched/clock: Use try_cmpxchg64 in sched_clock_{local,remote} 2022-05-19 23:46:09 +02:00
completion.c sched/completion: Add wait_for_completion_state() 2022-09-07 21:53:49 +02:00
core_sched.c sched: Rename task_running() to task_on_cpu() 2022-09-07 21:53:47 +02:00
core.c sched/smt: Fix unbalance sched_smt_present dec/inc 2024-08-14 13:53:00 +02:00
cpuacct.c Merge branch 'sched/fast-headers' into sched/core 2022-03-15 09:05:05 +01:00
cpudeadline.c sched/core: Introduce sched_asym_cpucap_active() 2022-08-02 12:32:45 +02:00
cpudeadline.h sched/headers: Simplify and clean up header usage in the scheduler 2018-03-04 12:39:29 +01:00
cpufreq_schedutil.c cpufreq: schedutil: Update next_freq when cpufreq_limits change 2023-10-25 12:03:11 +02:00
cpufreq.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
cpupri.c sched/rt: Fix live lock between select_fallback_rq() and RT push 2023-10-06 14:57:02 +02:00
cpupri.h sched/cpupri: Add CPUPRI_HIGHER 2020-10-29 11:00:30 +01:00
cputime.c sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-08-14 13:52:50 +02:00
deadline.c sched: Fix stop_one_cpu_nowait() vs hotplug 2023-11-20 11:51:50 +01:00
debug.c - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in 2022-10-10 17:53:04 -07:00
fair.c sched/fair: Use all little CPUs for CPU-bound workloads 2024-08-03 08:49:33 +02:00
features.h sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg 2022-06-28 09:08:30 +02:00
idle.c kernel/sched: Modify initial boot task idle setup 2023-10-06 14:57:02 +02:00
isolation.c sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU 2024-06-12 11:03:00 +02:00
loadavg.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
Makefile sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
membarrier.c sched/membarrier: reduce the ability to hammer on sys_membarrier 2024-02-23 09:12:52 +01:00
pelt.c sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
pelt.h sched/fair: Decay task PELT values during wakeup migration 2022-06-28 09:17:46 +02:00
psi.c sched: psi: fix bogus pressure spikes from aggregation race 2024-10-17 15:22:06 +02:00
rt.c sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset 2024-03-01 13:26:24 +01:00
sched-pelt.h sched/fair: Fix "runnable_avg_yN_inv" not used warnings 2019-06-17 12:15:58 +02:00
sched.h sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks 2024-08-03 08:49:31 +02:00
smp.h smp: Rename flush_smp_call_function_from_idle() 2022-05-01 10:03:43 +02:00
stats.c profiling: remove profile=sleep support 2024-08-14 13:52:50 +02:00
stats.h sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath 2024-07-18 13:18:43 +02:00
stop_task.c sched: Add update_current_exec_runtime helper 2022-08-27 00:05:35 +02:00
swait.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
topology.c sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level 2024-06-12 11:03:33 +02:00
wait_bit.c wait_on_bit: add an acquire memory barrier 2022-08-26 09:30:25 -07:00
wait.c wait: Return number of exclusive waiters awaken 2023-03-10 09:34:34 +01:00