Now that the (de-)offloading process can only apply to offline CPUs,
there is no more concurrency between rcu_core and nocb kthreads. Also
the mutation now happens on empty queues.
Therefore the state machine can be reduced to a single bit called
SEGCBLIST_OFFLOADED. Simplify the transition as follows:
* Upon offloading: queue the rdp to be added to the rcuog list and
wait for the rcuog kthread to set the SEGCBLIST_OFFLOADED bit. Unpark
rcuo kthread.
* Upon de-offloading: Park rcuo kthread. Queue the rdp to be removed
from the rcuog list and wait for the rcuog kthread to clear the
SEGCBLIST_OFFLOADED bit.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Add a kvfree_rcu_barrier() function. It waits until all
in-flight pointers are freed over RCU machinery. It does
not wait any GP completion and it is within its right to
return immediately if there are no outstanding pointers.
This function is useful when there is a need to guarantee
that a memory is fully freed before destroying memory caches.
For example, during unloading a kernel module.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Mark emergency sections wherever multiple lines of
rcu stall information are generated. In an emergency
section, every printk() call will attempt to directly
flush to the consoles using the EMERGENCY priority.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240820063001.36405-35-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
When soft interrupt actions are called, they are passed a pointer to the
struct softirq action which contains the action's function pointer.
This pointer isn't useful, as the action callback already knows what
function it is. And since each callback handles a specific soft interrupt,
the callback also knows which soft interrupt number is running.
No soft interrupt action callback actually uses this parameter, so remove
it from the function pointer signature. This clarifies that soft interrupt
actions are global routines and makes it slightly cheaper to call them.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, replace "dyntick_idle" into "eqs" to drop the dyntick
reference.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, drop the dyntick reference and update the name of this helper
to express that it rechecks rdp->watching_snap after an earlier
rcu_watching_snap_save().
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the snapshot helpers are now prefix by
"rcu_watching". Reflect that change into the storage variables for these
snapshots.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the snapshot helpers are now prefix by
"rcu_watching". Reflect that change into the storage variables for these
snapshots.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, the dynticks prefix can go.
While at it, this helper is only meant to be called after failing an
earlier call to rcu_watching_snap_in_eqs(), document this in the comments
and add a WARN_ON_ONCE() for good measure.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
While at it, update a comment that still refers to rcu_dynticks_snap(),
which was removed by commit:
7be2e6323b9b ("rcu: Remove full memory barrier on RCU stall printout")
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Note that "watching" is the opposite of "in EQS", so the negation is lifted
out of the helper and into the callsites.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
'struct ref_scale_ops' are not modified in these drivers.
Constifying this structure moves some data to a read-only section, so
increase overall security.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
34231 4167 736 39134 98de kernel/rcu/refscale.o
After:
=====
text data bss dec hex filename
35175 3239 736 39150 98ee kernel/rcu/refscale.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The current rcu_scale_writer() asynchronous grace-period testing uses a
per-CPU counter to track the number of outstanding callbacks. This is
subject to CPU-imbalance errors when tasks migrate from one CPU to another
between the time that the counter is incremented and the callback is
queued, and additionally in kernels configured such that callbacks can
be invoked on some CPU other than the one that queued it.
This commit therefore arranges for per-task callback counts, thus avoiding
any issues with migration of either tasks or callbacks.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Currently, if someone modprobes and rmmods rcuscale successfully, but
the next run errors out during the modprobe, non-NULL pointers to freed
memory will remain. If the run after that also errors out during the
modprobe, there will be double-free bugs.
This commit therefore NULLs out top-level pointers to memory that has
just been freed.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The rcu_scale_writer() function needs only a fixed number of rcu_head
structures per kthread, which means that a trivial allocator suffices.
This commit therefore uses an llist-based allocator using a fixed array of
structures per kthread. This allows aggressive testing of RCU performance
without stressing the slab allocators.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Under some conditions, kmalloc(GFP_KERNEL) allocations have been
observed to repeatedly fail. This situation has been observed to
cause one of the rcu_scale_writer() instances to loop indefinitely
retrying memory allocation for an asynchronous grace-period primitive.
The problem is that if memory is short, all the other instances will
allocate all available memory before the looping task is awakened from
its rcu_barrier*() call. This in turn results in hangs, so that rcuscale
fails to complete.
This commit therefore removes the tight retry loop, so that when this
condition occurs, the affected task is still passing through the full
loop with its full set of termination checks. This spreads the risk
of indefinite memory-allocation retry failures across all instances of
rcu_scale_writer() tasks, which in turn prevents the hangs.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit causes all writer tasks to provide a brief report after a
hang has been reported, spaced at one-second intervals.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Currently, if the rcuscale module's async module parameter is specified
for RCU implementations that do not have async primitives such as RCU
Tasks Rude (which now lacks a call_rcu_tasks_rude() function), there
will be a series of splats due to calls to a NULL pointer. This commit
therefore warns of this situation, but switches to non-async testing.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The commit 2d7f00b2f0 ("rcu: Suppress smp_processor_id() complaint
in synchronize_rcu_expedited_wait()") disabled preemption around
dump_cpu_task() to suppress warning on its usage within preemtible context.
Calling dump_cpu_task() doesn't required to be in non-preemptible context
except for suppressing the smp_processor_id() warning.
As the smp_processor_id() is evaluated along with in_hardirq()
to check if it's in interrupt context, this patch removes the need
for its preemtion disablement by reordering the condition so that
smp_processor_id() only gets evaluated when it's in interrupt context.
Signed-off-by: Ryo Takakura <takakura@valinux.co.jp>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
During CSD-lock stalls, the additional information output by expedited
RCU CPU stall warnings is usually redundant, flooding the console for
not good reason. However, this has been the way things work for a few
years. This commit therefore uses rcutree.csd_lock_suppress_rcu_stall
kernel boot parameter that causes expedited RCU CPU stall warnings to
be abbreviated to a single line when there is at least one CPU that has
been stuck waiting for CSD lock for more than five seconds.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit extracts the RCU CPU stall-warning report code from
synchronize_rcu_expedited_wait() and places it in a new function named
synchronize_rcu_expedited_stall(). This is strictly a code-movement
commit. A later commit will use this reorganization to avoid printing
expedited RCU CPU stall warnings while there are ongoing CSD-lock stall
reports.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
During CSD-lock stalls, the additional information output by RCU CPU
stall warnings is usually redundant, flooding the console for not good
reason. However, this has been the way things work for a few years.
This commit therefore adds an rcutree.csd_lock_suppress_rcu_stall kernel
boot parameter that causes RCU CPU stall warnings to be abbreviated to
a single line when there is at least one CPU that has been stuck waiting
for CSD lock for more than five seconds.
To make this abbreviated message happen with decent probability:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 8 \
--configs "2*TREE01" --kconfig "CONFIG_CSD_LOCK_WAIT_DEBUG=y" \
--bootargs "csdlock_debug=1 rcutorture.stall_cpu=200 \
rcutorture.stall_cpu_holdoff=120 rcutorture.stall_cpu_irqsoff=1 \
rcutree.csd_lock_suppress_rcu_stall=1 \
rcupdate.rcu_exp_cpu_stall_timeout=5000" --trust-make
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit uses the new rcu_tasks_torture_stats_print(),
rcu_tasks_trace_torture_stats_print(), and
rcu_tasks_rude_torture_stats_print() functions in order to provide
detailed diagnostics on grace-period, callback, and barrier state when
rcu_scale_writer() hangs.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
RCU keeps a count of the number of callbacks that the current
rcu_barrier() is waiting on, but there is currently no easy way to
work out which callback is stuck. One way to do this is to mark idle
RCU-barrier callbacks by making the ->next pointer point to the callback
itself, and this commit does just that.
Later commits will use this for debug output.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds a .stats function pointer to the rcu_scale_ops structure,
and if this is non-NULL, it is invoked after stack traces are dumped in
response to a rcu_scale_writer() stall.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit improves debuggability by dumping the stacks of
rcu_scale_writer() instances that have not completed in a reasonable
timeframe. These stacks are dumped remotely, but they will be accurate
in the thus-far common case where the stalled rcu_scale_writer() instances
are blocked.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This whitespace-only commit fuses a few lines of code, taking advantage
of the newish 100-character-per-line limit to save a few lines of code.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
process_durations() is not a hot path, but there is no good reason to
iterate over and over the data already in 'buf'.
Using a seq_buf saves some useless strcat() and the need of a temp buffer.
Data is written directly at the correct place.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds the start time, in jiffies, of the most recently started
rcu_barrier_tasks*() operation to the diagnostic output used by rcuscale.
This information can be helpful in distinguishing a hung barrier operation
from a long series of barrier operations.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds rcu_tasks_torture_stats_print(),
rcu_tasks_trace_torture_stats_print(), and
rcu_tasks_rude_torture_stats_print() functions that provide detailed
diagnostics on grace-period, callback, and barrier state.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Each Tasks RCU flavor keeps a count of the number of callbacks that the
current rcu_barrier_tasks*() is waiting on, but there is currently no
easy way to work out which callback is stuck. One way to do this is to
mark idle RCU-barrier callbacks by making the ->next pointer point to
the callback itself, and this commit does just that.
Later commits will use this for debug output.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit provides a rcu_barrier_cb_is_done() function that returns
true if the *rcu_barrier*() callback passed in is done. This will be
used when printing grace-period debugging information.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The rtp->tasks_gp_seq grace-period sequence number is not a strict count,
but rather the usual RCU sequence number with the lower few bits tracking
per-grace-period state and the upper bits the count of grace periods
since boot, give or take the initial value. This commit therefore
adjusts this comment.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The current mapping of smp_processor_id() to a CPU processing Tasks-RCU
callbacks makes some assumptions about layout. This commit therefore
adds a WARN_ON() to check these assumptions.
[ neeraj.upadhyay: Replace nr_cpu_ids with rcu_task_cpu_ids. ]
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
For kernels built with CONFIG_FORCE_NR_CPUS=y, the nr_cpu_ids is
defined as NR_CPUS instead of the number of possible cpus, this
will cause the following system panic:
smpboot: Allowing 4 CPUs, 0 hotplug CPUs
...
setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:512 nr_node_ids:1
...
BUG: unable to handle page fault for address: ffffffff9911c8c8
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 15 Comm: rcu_tasks_trace Tainted: G W
6.6.21 #1 5dc7acf91a5e8e9ac9dcfc35bee0245691283ea6
RIP: 0010:rcu_tasks_need_gpcb+0x25d/0x2c0
RSP: 0018:ffffa371c00a3e60 EFLAGS: 00010082
CR2: ffffffff9911c8c8 CR3: 000000040fa20005 CR4: 00000000001706f0
Call Trace:
<TASK>
? __die+0x23/0x80
? page_fault_oops+0xa4/0x180
? exc_page_fault+0x152/0x180
? asm_exc_page_fault+0x26/0x40
? rcu_tasks_need_gpcb+0x25d/0x2c0
? __pfx_rcu_tasks_kthread+0x40/0x40
rcu_tasks_one_gp+0x69/0x180
rcu_tasks_kthread+0x94/0xc0
kthread+0xe8/0x140
? __pfx_kthread+0x40/0x40
ret_from_fork+0x34/0x80
? __pfx_kthread+0x40/0x40
ret_from_fork_asm+0x1b/0x80
</TASK>
Considering that there may be holes in the CPU numbers, use the
maximum possible cpu number, instead of nr_cpu_ids, for configuring
enqueue and dequeue limits.
[ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ]
Closes: https://lore.kernel.org/linux-input/CALMA0xaTSMN+p4xUXkzrtR5r6k7hgoswcaXx7baR_z9r5jjskw@mail.gmail.com/T/#u
Reported-by: Zhixu Liu <zhixu.liu@gmail.com>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently
unused. This commit therefore removes their definitions and boot-time
self-tests.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently
unused. Furthermore, the idea is to get rid of RCU Tasks Rude entirely
once all architectures have their deep-idle and entry/exit code correctly
marked as inline or noinstr. As a step towards this goal, this commit
therefore removes these two functions from rcuscale testing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The call_rcu_tasks_rude() and rcu_barrier_tasks_rude() APIs are currently
unused. Furthermore, the idea is to get rid of RCU Tasks Rude entirely
once all architectures have their deep-idle and entry/exit code correctly
marked as inline or noinstr. As a first step towards this goal, this
commit therefore removes these two functions from rcutorture testing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit adds an stall_cpu_repeat kernel, which is also the
rcutorture.stall_cpu_repeat boot parameter, to test repeated CPU stalls.
Note that only the first stall will pay attention to the stall_cpu_irqsoff
module parameter. For the second and subsequent stalls, interrupts will
be enabled. This is helpful when testing the interaction between RCU
CPU stall warnings and CSD-lock stall warnings.
Reported-by: Rik van Riel <riel@surriel.com>
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
SRCU keeps a count of the number of callbacks that the current
srcu_barrier() is waiting on, but there is currently no easy way to
work out which callback is stuck. One way to do this is to mark idle
SRCU-barrier callbacks by making the ->next pointer point to the callback
itself, and this commit does just that.
Later commits will use this for debug output.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
SRCU maintains the ->srcu_n_exp_nodelay and ->reschedule_count values
to guide heuristics governing auto-expediting of normal SRCU grace
periods and grace-period-state-machine delays. This commit adds KCSAN
ASSERT_EXCLUSIVE_WRITER() calls to check for concurrent updates to
these fields.
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Using a higher value for the initial gp sequence counters allows for
wrapping to occur faster. It can help with surfacing any issues that may
be happening as a result of the wrap around.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
It was discovered that isolated CPUs could sometimes be disturbed by
kworkers processing kfree_rcu() works causing higher than expected
latency. It is because the RCU core uses "system_wq" which doesn't have
the WQ_UNBOUND flag to handle all its work items. Fix this violation of
latency limits by using "system_unbound_wq" in the RCU core instead.
This will ensure that those work items will not be run on CPUs marked
as isolated.
Beside the WQ_UNBOUND flag, the other major difference between system_wq
and system_unbound_wq is their max_active count. The system_unbound_wq
has a max_active of WQ_MAX_ACTIVE (512) while system_wq's max_active
is WQ_DFL_ACTIVE (256) which is half of WQ_MAX_ACTIVE.
Reported-by: Vratislav Bendel <vbendel@redhat.com>
Closes: https://issues.redhat.com/browse/RHEL-50220
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Add the __counted_by compiler attribute to the flexible array member
records to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Increment nr_records before adding a new pointer to the records array.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
[ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ]
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any
meaning.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
[ neeraj.upadhyay: Fix htmldocs build error reported by Stephen Rothwell ]
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Within the rcu_sr_normal_gp_cleanup_work() function, there is an acquire
load from rcu_state.srs_done_tail, which is expected to be non-NULL.
This commit adds a WARN_ON_ONCE() to check this expectation.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The rcutorture test suite has specific tests for both of the
NUM_ACTIVE_RCU_POLL_OLDSTATE and NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE
macros provided for RCU polled grace periods. However, with the
advent of NUM_ACTIVE_SRCU_POLL_OLDSTATE, a more generic test is needed.
This commit therefore adds ->poll_active and ->poll_active_full fields
to the rcu_torture_ops structure and converts the existing specific
tests to use these fields, when present.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
This commit points the SRCU ->same_gp_state and ->get_comp_state fields
to same_state_synchronize_srcu() and get_completed_synchronize_srcu(),
allowing them to be tested.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The rcu_torture_ops structure's ->get_gp_completed and
->get_gp_completed_full fields are redundant with its ->get_comp_state
and ->get_comp_state_full fields. This commit therefore removes the
former in favor of the latter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
RCU core can't be running anymore while in the middle of (de-)offloading
since this sort of transition now only applies to offline CPUs.
The SEGCBLIST_RCU_CORE state can therefore be removed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
RCU core can't be running anymore while in the middle of (de-)offloading
since this sort of transition now only applies to offline CPUs.
The locked callback acceleration handling during the transition can
therefore be removed, along with concurrent batch execution.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
RCU core can't be running anymore while in the middle of (de-)offloading
since this sort of transition now only applies to offline CPUs.
The locked callback acceleration handling during the transition can
therefore be removed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Bypass enqueue can't happen anymore in the middle of (de-)offloading
since this sort of transition now only applies to offline CPUs.
The related safety check can therefore be removed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Currently callbacks can be (de-)offloaded only on online CPUs. This
involves an overly elaborated state machine in order to make sure that
callbacks are always handled during the process while ensuring
synchronization between rcu_core and NOCB kthreads.
The only potential user of NOCB (de-)offloading appears to be a
nohz_full toggling interface through cpusets. And the general agreement
is now to work toward toggling the nohz_full state on offline CPUs to
simplify the whole picture.
Therefore, convert the (de-)offloading to only support offline CPUs.
This involves the following changes:
* Call rcu_barrier() before deoffloading. An offline offloaded CPU may
still carry callbacks in its queue ignored by
rcutree_migrate_callbacks(). Those callbacks must all be flushed
before switching to a regular queue because no more kthreads will
handle those before the CPU ever gets re-onlined.
This means that further calls to rcu_barrier() will find an empty
queue until the CPU goes through rcutree_report_cpu_starting(). As a
result it is guaranteed that further rcu_barrier() won't try to lock
the nocb_lock for that target and thus won't risk an imbalance.
Therefore barrier_mutex doesn't need to be locked anymore upon
deoffloading.
* Assume the queue is empty before offloading, as
rcutree_migrate_callbacks() took care of everything.
This means that further calls to rcu_barrier() will find an empty
queue until the CPU goes through rcutree_report_cpu_starting(). As a
result it is guaranteed that further rcu_barrier() won't risk a
nocb_lock imbalance.
Therefore barrier_mutex doesn't need to be locked anymore upon
offloading.
* No need to flush bypass anymore.
Further simplifications will follow in upcoming patches.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The barrier_mutex is used currently to protect (de-)offloading
operations and prevent from nocb_lock locking imbalance in rcu_barrier()
and shrinker, and also from misordered RCU barrier invocation.
Now since RCU (de-)offloading is going to happen on offline CPUs, an RCU
barrier will have to be executed while transitionning from offloaded to
de-offloaded state. And this can't happen while holding the
barrier_mutex.
Introduce a NOCB mutex to protect (de-)offloading transitions. The
barrier_mutex is still held for now when necessary to avoid barrier
callbacks reordering and nocb_lock imbalance.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
When a NOCB CPU fails to create a nocb kthread on bringup, the CPU is
then deoffloaded. The barrier mutex is locked at this stage. It is
typically used to protect against concurrent (de-)offloading and/or
concurrent rcu_barrier() that would otherwise risk a nocb locking
imbalance. However:
* rcu_barrier() can't run concurrently if it's the boot CPU on early
boot-up.
* rcu_barrier() can run concurrently if it's a secondary CPU but it is
expected to see 0 callbacks on this target because it's the first
time it boots.
* (de-)offloading can't happen concurrently with smp_init(), as
rcutorture is initialized later, at least not before device_initcall(),
and userspace isn't available yet.
* (de-)offloading can't happen concurrently with cpu_up(), courtesy of
cpu_hotplug_lock.
But:
* The lazy shrinker might run concurrently with cpu_up(). It shouldn't
try to grab the nocb_lock and risk an imbalance due to lazy_len
supposed to be 0 but be extra cautious.
* Also be cautious against resume from hibernation potential subtleties.
So keep the locking and add some assertions and comments.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
nocb_is_setup is a rarely used field, mostly on boot and CPU hotplug.
It shouldn't occupy the middle of the rcu state hot fields cacheline.
Move it to the end and build it conditionally while at it. More cold
NOCB fields are to come.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Checking for races against concurrent (de-)offloading implies the
creation of !CONFIG_RCU_NOCB_CPU stubs to check if each relevant lock
is held. For now this only implies the nocb_lock but more are to be
expected.
Create instead a NOCB specific version of RCU_LOCKDEP_WARN() to avoid
the proliferation of stubs.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to
RCU_WATCHING, reflect that change in the related helpers.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
The symbols relating to the CT_STATE part of context_tracking.state are now
all prefixed with CT_STATE.
The RCU dynticks counter part of that atomic variable still involves
symbols with different prefixes, align them all to be prefixed with
CT_RCU_WATCHING.
Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off
rnp->qsmaskinitnext, it means that all accesses from the offline CPU
preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including
callbacks expiration and counter updates.
However interrupts can still fire after stop_machine() re-enables
interrupts and before rcutree_report_cpu_dead(). The related accesses
happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing
are _NOT_ guaranteed to be seen by rcu_barrier() without proper
ordering, especially when callbacks are invoked there to the end, making
rcutree_migrate_callback() bypass barrier_lock.
The following theoretical race example can make rcu_barrier() hang:
CPU 0 CPU 1
----- -----
//cpu_down()
smpboot_park_threads()
//ksoftirqd is parked now
<IRQ>
rcu_sched_clock_irq()
invoke_rcu_core()
do_softirq()
rcu_core()
rcu_do_batch()
// callback storm
// rcu_do_batch() returns
// before completing all
// of them
// do_softirq also returns early because of
// timeout. It defers to ksoftirqd but
// it's parked
</IRQ>
stop_machine()
take_cpu_down()
rcu_barrier()
spin_lock(barrier_lock)
// observes rcu_segcblist_n_cbs(&rdp->cblist) != 0
<IRQ>
do_softirq()
rcu_core()
rcu_do_batch()
//completes all pending callbacks
//smp_mb() implied _after_ callback number dec
</IRQ>
rcutree_report_cpu_dead()
rnp->qsmaskinitnext &= ~rdp->grpmask;
rcutree_migrate_callback()
// no callback, early return without locking
// barrier_lock
//observes !rcu_rdp_cpu_online(rdp)
rcu_barrier_entrain()
rcu_segcblist_entrain()
// Observe rcu_segcblist_n_cbs(rsclp) == 0
// because no barrier between reading
// rnp->qsmaskinitnext and rsclp->len
rcu_segcblist_add_len()
smp_mb__before_atomic()
// will now observe the 0 count and empty
// list, but too late, we enqueue regardless
WRITE_ONCE(rsclp->len, rsclp->len + v);
// ignored barrier callback
// rcu barrier stall...
This could be solved with a read memory barrier, enforcing the message
passing between rnp->qsmaskinitnext and rsclp->len, matching the full
memory barrier after rsclp->len addition in rcu_segcblist_add_len()
performed at the end of rcu_do_batch().
However the rcu_barrier() is complicated enough and probably doesn't
need too many more subtleties. CPU down is a slowpath and the
barrier_lock seldom contended. Solve the issue with unconditionally
locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure
that either rcu_barrier() sees the empty queue or its entrained
callback will be migrated.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_sync structure's ->gp_count field is always accessed under the
protection of that same structure's ->rss_lock field, with the exception
of a pair of WARN_ON_ONCE() calls just prior to acquiring that lock in
functions rcu_sync_exit() and rcu_sync_dtor(). These lockless accesses
are unnecessary and impair KCSAN's ability to catch bugs that might be
inserted via other lockless accesses.
This commit therefore moves those WARN_ON_ONCE() calls under the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If a CPU is running either a userspace application or a guest OS in
nohz_full mode, it is possible for a system call to occur just as an
RCU grace period is starting. If that CPU also has the scheduling-clock
tick enabled for any reason (such as a second runnable task), and if the
system was booted with rcutree.use_softirq=0, then RCU can add insult to
injury by awakening that CPU's rcuc kthread, resulting in yet another
task and yet more OS jitter due to switching to that task, running it,
and switching back.
In addition, in the common case where that system call is not of
excessively long duration, awakening the rcuc task is pointless.
This pointlessness is due to the fact that the CPU will enter an extended
quiescent state upon returning to the userspace application or guest OS.
In this case, the rcuc kthread cannot do anything that the main RCU
grace-period kthread cannot do on its behalf, at least if it is given
a few additional milliseconds (for example, given the time duration
specified by rcutree.jiffies_till_first_fqs, give or take scheduling
delays).
This commit therefore adds a rcutree.nohz_full_patience_delay kernel
boot parameter that specifies the grace period age (in milliseconds,
rounded to jiffies) before which RCU will refrain from awakening the
rcuc kthread. Preliminary experimentation suggests a value of 1000,
that is, one second. Increasing rcutree.nohz_full_patience_delay will
increase grace-period latency and in turn increase memory footprint,
so systems with constrained memory might choose a smaller value.
Systems with less-aggressive OS-jitter requirements might choose the
default value of zero, which keeps the traditional immediate-wakeup
behavior, thus avoiding increases in grace-period latency.
[ paulmck: Apply Leonardo Bras feedback. ]
Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@redhat.com/
Reported-by: Leonardo Bras <leobras@redhat.com>
Suggested-by: Leonardo Bras <leobras@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Leonardo Bras <leobras@redhat.com>
A full memory barrier is necessary at the end of the expedited grace
period to order:
1) The grace period completion (pictured by the GP sequence
number) with all preceding accesses. This pairs with rcu_seq_end()
performed by the concurrent kworker.
2) The grace period completion and subsequent post-GP update side
accesses. Pairs again against rcu_seq_end().
This full barrier is already provided by the final sync_exp_work_done()
test, making the subsequent explicit one redundant. Remove it and
improve comments.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
RCU stall printout fetches the EQS state of a CPU with a preceding full
memory barrier. However there is nothing to order this read against at
this debugging stage. It is inherently racy when performed remotely.
Do a plain read instead.
This was the last user of rcu_dynticks_snap().
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
When the boot CPU initializes the per-CPU data on behalf of all possible
CPUs, a sanity check is performed on each of them to make sure none is
initialized in an extended quiescent state.
This check involves a full memory barrier which is useless at this early
boot stage.
Do a plain access instead.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:
* If the GP kthread observes the remote target in an extended quiescent
state, then that target must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it exits that extended quiescent state.
or:
* If the GP kthread observes the remote target NOT in an extended
quiescent state, then the target further entering in an extended
quiescent state must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it enters that extended quiescent state.
This ordering is enforced through a full memory barrier placed right
before taking the first EQS snapshot. However this is superfluous
because the snapshot is taken while holding the target's rnp lock which
provides the necessary ordering through its chain of
smp_mb__after_unlock_lock().
Remove the needless explicit barrier before the snapshot and put a
comment about the implicit barrier newly relied upon here.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:
* If the GP kthread observes the remote target in an extended quiescent
state, then that target must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it exits that extended quiescent state.
or:
* If the GP kthread observes the remote target NOT in an extended
quiescent state, then the target further entering in an extended
quiescent state must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it enters that extended quiescent state.
This ordering is enforced through a full memory barrier placed right
before taking the first EQS snapshot. However this is superfluous
because the snapshot is taken while holding the target's rnp lock which
provides the necessary ordering through its chain of
smp_mb__after_unlock_lock().
Remove the needless explicit barrier before the snapshot and put a
comment about the implicit barrier newly relied upon here.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:
* If the GP kthread observes the remote target in an extended quiescent
state, then that target must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it exits that extended quiescent state. Also the GP kthread must
observe all accesses performed by the target prior it entering in
EQS.
or:
* If the GP kthread observes the remote target NOT in an extended
quiescent state, then the target further entering in an extended
quiescent state must observe all accesses prior to the current
grace period, including the current grace period sequence number, once
it enters that extended quiescent state. Also the GP kthread later
observing that EQS must also observe all accesses performed by the
target prior it entering in EQS.
This ordering is explicitly performed both on the first EQS snapshot
and on the second one as well through the combination of a preceding
full barrier followed by an acquire read. However the second snapshot's
full memory barrier is redundant and not needed to enforce the above
guarantees:
GP kthread Remote target
---- -----
// Access prior GP
WRITE_ONCE(A, 1)
// first snapshot
smp_mb()
x = smp_load_acquire(EQS)
// Access prior GP
WRITE_ONCE(B, 1)
// EQS enter
// implied full barrier by atomic_add_return()
atomic_add_return(RCU_DYNTICKS_IDX, EQS)
// implied full barrier by atomic_add_return()
READ_ONCE(A)
// second snapshot
y = smp_load_acquire(EQS)
z = READ_ONCE(B)
If the GP kthread above fails to observe the remote target in EQS
(x not in EQS), the remote target will observe A == 1 after further
entering in EQS. Then the second snapshot taken by the GP kthread only
need to be an acquire read in order to observe z == 1.
Therefore remove the needless full memory barrier on second snapshot.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds the get_completed_synchronize_srcu() and the
same_state_synchronize_srcu() functions. The first returns a cookie
that is always interpreted as corresponding to an expired grace period.
The second does an equality comparison of a pair of cookies.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have polled SRCU grace periods, a grace period can be
started by start_poll_synchronize_srcu() as well as call_srcu(),
synchronize_srcu(), and synchronize_srcu_expedited(). This commit
therefore calls out this new start_poll_synchronize_srcu() possibility
in the comment on the WARN_ON().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Interrupts are enabled in srcu_gp_end(), so this commit switches from
spin_lock_irqsave_rcu_node() and spin_unlock_irqrestore_rcu_node()
to spin_lock_irq_rcu_node() and spin_unlock_irq_rcu_node().
Link: https://lore.kernel.org/all/febb13ab-a4bb-48b4-8e97-7e9f7749e6da@moroto.mountain/
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Interrupts are enabled in rcu_gp_init(), so this commit switches from
local_irq_save() and local_irq_restore() to local_irq_disable() and
local_irq_enable().
Link: https://lore.kernel.org/all/febb13ab-a4bb-48b4-8e97-7e9f7749e6da@moroto.mountain/
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In the synchronize_rcu() common case, we will have less than
SR_MAX_USERS_WAKE_FROM_GP number of users per GP. Waking up the kworker
is pointless just to free the last injected wait head since at that point,
all the users have already been awakened.
Introduce a new counter to track this and prevent the wakeup in the
common case.
[ paulmck: Remove atomic_dec_return_release in cannot-happen state. ]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running
on all online CPUs, no explicit ordering synchronizes properly with a
context switch. This lack of ordering can permit the new task to miss
pre-grace-period update-side accesses. The following diagram, courtesy
of Paul, shows the possible bad scenario:
CPU 0 CPU 1
----- -----
// Pre-GP update side access
WRITE_ONCE(*X, 1);
smp_mb();
r0 = rq->curr;
RCU_INIT_POINTER(rq->curr, TASK_B)
spin_unlock(rq)
rcu_read_lock_trace()
r1 = X;
/* ignore TASK_B */
Either r0==TASK_B or r1==1 is needed but neither is guaranteed.
One possible solution to solve this is to wait for an RCU grace period
at the beginning of the RCU-tasks-trace grace period before taking the
current tasks snaphot. However this would introduce large additional
latencies to RCU-tasks-trace grace periods.
Another solution is to lock the target runqueue while taking the current
task snapshot. This ensures that the update side sees the latest context
switch and subsequent context switches will see the pre-grace-period
update side accesses.
This commit therefore adds runqueue locking to cpu_curr_snapshot().
Fixes: e386b67257 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Fix the following 'make W=1' warnings:
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcutorture.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcuscale.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/refscale.o
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
On powerpc systems, spinlock acquisition does not order prior stores
against later loads. This means that this statement:
rfcp->rfc_next = NULL;
Can be reordered to follow this statement:
WRITE_ONCE(*rfcpp, rfcp);
Which is then a data race with rcu_torture_fwd_prog_cr(), specifically,
this statement:
rfcpn = READ_ONCE(rfcp->rfc_next)
KCSAN located this data race, which represents a real failure on powerpc.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: <kasan-dev@googlegroups.com>
This commit allows rcutorture to test double-call_srcu() when the
CONFIG_DEBUG_OBJECTS_RCU_HEAD Kconfig option is enabled. The non-raw
sdp structure's ->spinlock will be acquired in call_srcu(), hence this
commit also removes the current IRQ and preemption disabling so as to
avoid lockdep complaints.
Link: https://lore.kernel.org/all/20240407112714.24460-1-qiang.zhang1211@gmail.com/
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This reverts commit 28319d6dc5. The race
it fixed was subject to conditions that don't exist anymore since:
1612160b91 ("rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks")
This latter commit removes the use of SRCU that used to cover the
RCU-tasks blind spot on exit between the tasklist's removal and the
final preemption disabling. The task is now placed instead into a
temporary list inside which voluntary sleeps are accounted as RCU-tasks
quiescent states. This would disarm the deadlock initially reported
against PID namespace exit.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The bypass lock contention mitigation assumes there can be at most
2 contenders on the bypass lock, following this scheme:
1) One kthread takes the bypass lock
2) Another one spins on it and increment the contended counter
3) A third one (a bypass enqueuer) sees the contended counter on and
busy loops waiting on it to decrement.
However this assumption is wrong. There can be only one CPU to find the
lock contended because call_rcu() (the bypass enqueuer) is the only
bypass lock acquire site that may not already hold the NOCB lock
beforehand, all the other sites must first contend on the NOCB lock.
Therefore step 2) is impossible.
The other problem is that the mitigation assumes that contenders all
belong to the same rdp CPU, which is also impossible for a raw spinlock.
In theory the warning could trigger if the enqueuer holds the bypass
lock and another CPU flushes the bypass queue concurrently but this is
prevented from all flush users:
1) NOCB kthreads only flush if they successfully _tried_ to lock the
bypass lock. So no contention management here.
2) Flush on callbacks migration happen remotely when the CPU is offline.
No concurrency against bypass enqueue.
3) Flush on deoffloading happen either locally with IRQs disabled or
remotely when the CPU is not yet online. No concurrency against
bypass enqueue.
4) Flush on barrier entrain happen either locally with IRQs disabled or
remotely when the CPU is offline. No concurrency against
bypass enqueue.
For those reasons, the bypass lock contention mitigation isn't needed
and is even wrong. Remove it but keep the warning reporting a contended
bypass lock on a remote CPU, to keep unexpected contention awareness.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Upon NOCB deoffloading, the rcuo kthread must be forced to sleep
until the corresponding rdp is ever offloaded again. The deoffloader
clears the SEGCBLIST_OFFLOADED flag, wakes up the rcuo kthread which
then notices that change and clears in turn its SEGCBLIST_KTHREAD_CB
flag before going to sleep, until it ever sees the SEGCBLIST_OFFLOADED
flag again, should a re-offloading happen.
Upon NOCB offloading, the rcuo kthread must be forced to wake up and
handle callbacks until the corresponding rdp is ever deoffloaded again.
The offloader sets the SEGCBLIST_OFFLOADED flag, wakes up the rcuo
kthread which then notices that change and sets in turn its
SEGCBLIST_KTHREAD_CB flag before going to check callbacks, until it
ever sees the SEGCBLIST_OFFLOADED flag cleared again, should a
de-offloading happen again.
This is all a crude ad-hoc and error-prone kthread (un-)parking
re-implementation.
Consolidate the behaviour with the appropriate API instead.
[ paulmck: Apply Qiang Zhang feedback provided in Link: below. ]
Link: https://lore.kernel.org/all/20240509074046.15629-1-qiang.zhang1211@gmail.com/
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_gp_slow_register/unregister() is only useful in tests where
torture_type=rcu, so this commit therefore generates ->gp_slow_register()
and ->gp_slow_unregister() function pointers in the rcu_torture_ops
structure, and slows grace periods only when these function pointers
exist.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
For these rcu_torture_ops structure's objects defined by using static,
if the value of the function pointer in its member is not set, the default
value will be NULL, this commit therefore remove the pre-existing
initialization of function pointers to NULL.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
This commit make rcu-tasks related rcutorture test support rcu-tasks
gp state printing when the writer stall occurs or the at the end of
rcutorture test, and generate rcu_ops->get_gp_data() operation to
simplify the acquisition of gp state for different types of rcutorture
tests.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>