2019-01-17 10:23:39 -08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0+ */
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
/*
|
|
|
|
* Read-Copy Update mechanism for mutual exclusion (tree-based version)
|
|
|
|
* Internal non-public definitions that provide either classic
|
2011-03-02 13:15:15 -08:00
|
|
|
* or preemptible semantics.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
*
|
|
|
|
* Copyright Red Hat, 2009
|
|
|
|
* Copyright IBM Corporation, 2009
|
|
|
|
*
|
|
|
|
* Author: Ingo Molnar <mingo@elte.hu>
|
2019-01-17 10:23:39 -08:00
|
|
|
* Paul E. McKenney <paulmck@linux.ibm.com>
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
*/
|
|
|
|
|
2014-06-12 13:30:25 -07:00
|
|
|
#include "../locking/rtmutex_common.h"
|
2011-08-19 11:39:11 -07:00
|
|
|
|
2012-08-19 21:35:53 -07:00
|
|
|
#ifdef CONFIG_RCU_NOCB_CPU
|
|
|
|
static cpumask_var_t rcu_nocb_mask; /* CPUs to have callbacks offloaded. */
|
rcu: Make rcu_nocb_poll an early_param instead of module_param
The as-documented rcu_nocb_poll will fail to enable this feature
for two reasons. (1) there is an extra "s" in the documented
name which is not in the code, and (2) since it uses module_param,
it really is expecting a prefix, akin to "rcutree.fanout_leaf"
and the prefix isn't documented.
However, there are several reasons why we might not want to
simply fix the typo and add the prefix:
1) we'd end up with rcutree.rcu_nocb_poll, and rather probably make
a change to rcutree.nocb_poll
2) if we did #1, then the prefix wouldn't be consistent with the
rcu_nocbs=<cpumap> parameter (i.e. one with, one without prefix)
3) the use of module_param in a header file is less than desired,
since it isn't immediately obvious that it will get processed
via rcutree.c and get the prefix from that (although use of
module_param_named() could clarify that.)
4) the implied export of /sys/module/rcutree/parameters/rcu_nocb_poll
data to userspace via module_param() doesn't really buy us anything,
as it is read-only and we can tell if it is enabled already without
it, since there is a printk at early boot telling us so.
In light of all that, just change it from a module_param() to an
early_setup() call, and worry about adding it to /sys later on if
we decide to allow a dynamic setting of it.
Also change the variable to be tagged as read_mostly, since it
will only ever be fiddled with at most, once at boot.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-12-20 13:19:22 -08:00
|
|
|
static bool __read_mostly rcu_nocb_poll; /* Offload kthread are to poll. */
|
2020-11-12 01:51:21 +01:00
|
|
|
static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
return lockdep_is_held(&rdp->nocb_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
/* Race on early boot between thread creation and assignment */
|
|
|
|
if (!rdp->nocb_cb_kthread || !rdp->nocb_gp_kthread)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (current == rdp->nocb_cb_kthread || current == rdp->nocb_gp_kthread)
|
|
|
|
if (in_task())
|
|
|
|
return true;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool rcu_running_nocb_timer(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
return (timer_curr_running(&rdp->nocb_timer) && !in_irq());
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool rcu_running_nocb_timer(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2012-08-19 21:35:53 -07:00
|
|
|
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
|
|
|
|
|
2020-11-12 01:51:21 +01:00
|
|
|
static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* In order to read the offloaded state of an rdp is a safe
|
|
|
|
* and stable way and prevent from its value to be changed
|
|
|
|
* under us, we must either hold the barrier mutex, the cpu
|
|
|
|
* hotplug lock (read or write) or the nocb lock. Local
|
|
|
|
* non-preemptible reads are also safe. NOCB kthreads and
|
|
|
|
* timers have their own means of synchronization against the
|
|
|
|
* offloaded state updaters.
|
|
|
|
*/
|
|
|
|
RCU_LOCKDEP_WARN(
|
|
|
|
!(lockdep_is_held(&rcu_state.barrier_mutex) ||
|
|
|
|
(IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_held()) ||
|
|
|
|
rcu_lockdep_is_held_nocb(rdp) ||
|
|
|
|
(rdp == this_cpu_ptr(&rcu_data) &&
|
|
|
|
!(IS_ENABLED(CONFIG_PREEMPT_COUNT) && preemptible())) ||
|
|
|
|
rcu_current_is_nocb_kthread(rdp) ||
|
|
|
|
rcu_running_nocb_timer(rdp)),
|
|
|
|
"Unsafe read of RCU_NOCB offloaded state"
|
|
|
|
);
|
|
|
|
|
|
|
|
return rcu_segcblist_is_offloaded(&rdp->cblist);
|
|
|
|
}
|
|
|
|
|
2010-04-13 14:19:23 -07:00
|
|
|
/*
|
|
|
|
* Check the RCU kernel configuration parameters and print informative
|
2015-09-29 08:47:49 -07:00
|
|
|
* messages about anything out of the ordinary.
|
2010-04-13 14:19:23 -07:00
|
|
|
*/
|
|
|
|
static void __init rcu_bootup_announce_oddness(void)
|
|
|
|
{
|
2015-01-21 16:58:06 -08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_TRACE))
|
2017-05-15 15:30:32 -07:00
|
|
|
pr_info("\tRCU event tracing is enabled.\n");
|
2015-04-20 14:27:43 -07:00
|
|
|
if ((IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 64) ||
|
|
|
|
(!IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 32))
|
2018-05-14 13:27:33 -07:00
|
|
|
pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d.\n",
|
|
|
|
RCU_FANOUT);
|
2015-04-20 10:27:15 -07:00
|
|
|
if (rcu_fanout_exact)
|
2015-01-21 16:58:06 -08:00
|
|
|
pr_info("\tHierarchical RCU autobalancing is disabled.\n");
|
|
|
|
if (IS_ENABLED(CONFIG_RCU_FAST_NO_HZ))
|
|
|
|
pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n");
|
2017-05-12 14:37:19 -07:00
|
|
|
if (IS_ENABLED(CONFIG_PROVE_RCU))
|
2015-01-21 16:58:06 -08:00
|
|
|
pr_info("\tRCU lockdep checking is enabled.\n");
|
2020-08-05 15:51:20 -07:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
|
|
|
|
pr_info("\tRCU strict (and thus non-scalable) grace periods enabled.\n");
|
2015-06-03 08:18:31 +02:00
|
|
|
if (RCU_NUM_LVLS >= 4)
|
|
|
|
pr_info("\tFour(or more)-level hierarchy is enabled.\n");
|
2015-04-21 09:12:13 -07:00
|
|
|
if (RCU_FANOUT_LEAF != 16)
|
2015-01-21 20:58:57 -08:00
|
|
|
pr_info("\tBuild-time adjustment of leaf fanout to %d.\n",
|
2015-04-21 09:12:13 -07:00
|
|
|
RCU_FANOUT_LEAF);
|
|
|
|
if (rcu_fanout_leaf != RCU_FANOUT_LEAF)
|
2018-05-14 13:27:33 -07:00
|
|
|
pr_info("\tBoot-time adjustment of leaf fanout to %d.\n",
|
|
|
|
rcu_fanout_leaf);
|
2012-05-08 21:00:28 -07:00
|
|
|
if (nr_cpu_ids != NR_CPUS)
|
2017-09-08 16:14:18 -07:00
|
|
|
pr_info("\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%u.\n", NR_CPUS, nr_cpu_ids);
|
2017-04-28 11:12:34 -07:00
|
|
|
#ifdef CONFIG_RCU_BOOST
|
2018-05-14 13:27:33 -07:00
|
|
|
pr_info("\tRCU priority boosting: priority %d delay %d ms.\n",
|
|
|
|
kthread_prio, CONFIG_RCU_BOOST_DELAY);
|
2017-04-28 11:12:34 -07:00
|
|
|
#endif
|
|
|
|
if (blimit != DEFAULT_RCU_BLIMIT)
|
|
|
|
pr_info("\tBoot-time adjustment of callback invocation limit to %ld.\n", blimit);
|
|
|
|
if (qhimark != DEFAULT_RCU_QHIMARK)
|
|
|
|
pr_info("\tBoot-time adjustment of callback high-water mark to %ld.\n", qhimark);
|
|
|
|
if (qlowmark != DEFAULT_RCU_QLOMARK)
|
|
|
|
pr_info("\tBoot-time adjustment of callback low-water mark to %ld.\n", qlowmark);
|
2019-10-30 11:56:10 -07:00
|
|
|
if (qovld != DEFAULT_RCU_QOVLD)
|
2019-12-12 17:36:43 +00:00
|
|
|
pr_info("\tBoot-time adjustment of callback overload level to %ld.\n", qovld);
|
2017-04-28 11:12:34 -07:00
|
|
|
if (jiffies_till_first_fqs != ULONG_MAX)
|
|
|
|
pr_info("\tBoot-time adjustment of first FQS scan delay to %ld jiffies.\n", jiffies_till_first_fqs);
|
|
|
|
if (jiffies_till_next_fqs != ULONG_MAX)
|
|
|
|
pr_info("\tBoot-time adjustment of subsequent FQS scan delay to %ld jiffies.\n", jiffies_till_next_fqs);
|
rcu: Compute jiffies_till_sched_qs from other kernel parameters
The jiffies_till_sched_qs value used to determine how old a grace period
must be before RCU enlists the help of the scheduler to force a quiescent
state on the holdout CPU. Currently, this defaults to HZ/10 regardless of
system size and may be set only at boot time. This can be a problem for
very large systems, because if the values of the jiffies_till_first_fqs
and jiffies_till_next_fqs kernel parameters are left at their defaults,
they are calculated to increase as the number of CPUs actually configured
on the system increases. Thus, on a sufficiently large system, RCU would
enlist the help of the scheduler before the grace-period kthread had a
chance to scan for idle CPUs, which wastes CPU time.
This commit therefore allows jiffies_till_sched_qs to be set, if desired,
but if left as default, computes is as jiffies_till_first_fqs plus twice
jiffies_till_next_fqs, thus allowing three force-quiescent-state scans
for idle CPUs. This scales with the number of CPUs, providing sensible
default values.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-07-25 11:25:23 -07:00
|
|
|
if (jiffies_till_sched_qs != ULONG_MAX)
|
|
|
|
pr_info("\tBoot-time adjustment of scheduler-enlistment delay to %ld jiffies.\n", jiffies_till_sched_qs);
|
2017-04-28 11:12:34 -07:00
|
|
|
if (rcu_kick_kthreads)
|
|
|
|
pr_info("\tKick kthreads if too-long grace period.\n");
|
|
|
|
if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
|
|
|
|
pr_info("\tRCU callback double-/use-after-free debug enabled.\n");
|
rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-10 14:36:55 -07:00
|
|
|
if (gp_preinit_delay)
|
2017-04-28 11:12:34 -07:00
|
|
|
pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
|
rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-10 14:36:55 -07:00
|
|
|
if (gp_init_delay)
|
2017-04-28 11:12:34 -07:00
|
|
|
pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
|
rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead. The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.
This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead. However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed. TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-05-10 14:36:55 -07:00
|
|
|
if (gp_cleanup_delay)
|
2017-04-28 11:12:34 -07:00
|
|
|
pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_cleanup_delay);
|
2019-03-20 22:13:33 +01:00
|
|
|
if (!use_softirq)
|
|
|
|
pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
|
2017-04-28 11:12:34 -07:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
|
|
|
|
pr_info("\tRCU debug extended QS entry/exit.\n");
|
2017-04-28 10:20:28 -07:00
|
|
|
rcupdate_announce_bootup_oddness();
|
2010-04-13 14:19:23 -07:00
|
|
|
}
|
|
|
|
|
2014-09-22 14:00:48 -04:00
|
|
|
#ifdef CONFIG_PREEMPT_RCU
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
2018-07-03 17:22:34 -07:00
|
|
|
static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
|
2018-05-08 15:29:10 -07:00
|
|
|
static void rcu_read_unlock_special(struct task_struct *t);
|
2009-12-02 12:10:15 -08:00
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
/*
|
|
|
|
* Tell them what RCU they are running.
|
|
|
|
*/
|
2009-11-11 11:28:06 -08:00
|
|
|
static void __init rcu_bootup_announce(void)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
{
|
2013-03-18 16:24:11 -07:00
|
|
|
pr_info("Preemptible hierarchical RCU implementation.\n");
|
2010-04-13 14:19:23 -07:00
|
|
|
rcu_bootup_announce_oddness();
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
}
|
|
|
|
|
2015-08-02 13:53:17 -07:00
|
|
|
/* Flags for rcu_preempt_ctxt_queue() decision table. */
|
|
|
|
#define RCU_GP_TASKS 0x8
|
|
|
|
#define RCU_EXP_TASKS 0x4
|
|
|
|
#define RCU_GP_BLKD 0x2
|
|
|
|
#define RCU_EXP_BLKD 0x1
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Queues a task preempted within an RCU-preempt read-side critical
|
|
|
|
* section into the appropriate location within the ->blkd_tasks list,
|
|
|
|
* depending on the states of any ongoing normal and expedited grace
|
|
|
|
* periods. The ->gp_tasks pointer indicates which element the normal
|
|
|
|
* grace period is waiting on (NULL if none), and the ->exp_tasks pointer
|
|
|
|
* indicates which element the expedited grace period is waiting on (again,
|
|
|
|
* NULL if none). If a grace period is waiting on a given element in the
|
|
|
|
* ->blkd_tasks list, it also waits on all subsequent elements. Thus,
|
|
|
|
* adding a task to the tail of the list blocks any grace period that is
|
|
|
|
* already waiting on one of the elements. In contrast, adding a task
|
|
|
|
* to the head of the list won't block any grace period that is already
|
|
|
|
* waiting on one of the elements.
|
|
|
|
*
|
|
|
|
* This queuing is imprecise, and can sometimes make an ongoing grace
|
|
|
|
* period wait for a task that is not strictly speaking blocking it.
|
|
|
|
* Given the choice, we needlessly block a normal grace period rather than
|
|
|
|
* blocking an expedited grace period.
|
|
|
|
*
|
|
|
|
* Note that an endless sequence of expedited grace periods still cannot
|
|
|
|
* indefinitely postpone a normal grace period. Eventually, all of the
|
|
|
|
* fixed number of preempted tasks blocking the normal grace period that are
|
|
|
|
* not also blocking the expedited grace period will resume and complete
|
|
|
|
* their RCU read-side critical sections. At that point, the ->gp_tasks
|
|
|
|
* pointer will equal the ->exp_tasks pointer, at which point the end of
|
|
|
|
* the corresponding expedited grace period will also be the end of the
|
|
|
|
* normal grace period.
|
|
|
|
*/
|
2015-10-07 09:10:48 -07:00
|
|
|
static void rcu_preempt_ctxt_queue(struct rcu_node *rnp, struct rcu_data *rdp)
|
|
|
|
__releases(rnp->lock) /* But leaves rrupts disabled. */
|
2015-08-02 13:53:17 -07:00
|
|
|
{
|
|
|
|
int blkd_state = (rnp->gp_tasks ? RCU_GP_TASKS : 0) +
|
|
|
|
(rnp->exp_tasks ? RCU_EXP_TASKS : 0) +
|
|
|
|
(rnp->qsmask & rdp->grpmask ? RCU_GP_BLKD : 0) +
|
|
|
|
(rnp->expmask & rdp->grpmask ? RCU_EXP_BLKD : 0);
|
|
|
|
struct task_struct *t = current;
|
|
|
|
|
2018-01-17 06:24:30 -08:00
|
|
|
raw_lockdep_assert_held_rcu_node(rnp);
|
2017-07-11 21:52:31 -07:00
|
|
|
WARN_ON_ONCE(rdp->mynode != rnp);
|
2018-04-13 17:11:44 -07:00
|
|
|
WARN_ON_ONCE(!rcu_is_leaf_node(rnp));
|
2018-05-03 10:35:33 -07:00
|
|
|
/* RCU better not be waiting on newly onlined CPUs! */
|
|
|
|
WARN_ON_ONCE(rnp->qsmaskinitnext & ~rnp->qsmaskinit & rnp->qsmask &
|
|
|
|
rdp->grpmask);
|
2017-04-28 13:19:28 -07:00
|
|
|
|
2015-08-02 13:53:17 -07:00
|
|
|
/*
|
|
|
|
* Decide where to queue the newly blocked task. In theory,
|
|
|
|
* this could be an if-statement. In practice, when I tried
|
|
|
|
* that, it was quite messy.
|
|
|
|
*/
|
|
|
|
switch (blkd_state) {
|
|
|
|
case 0:
|
|
|
|
case RCU_EXP_TASKS:
|
|
|
|
case RCU_EXP_TASKS + RCU_GP_BLKD:
|
|
|
|
case RCU_GP_TASKS:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_TASKS:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Blocking neither GP, or first task blocking the normal
|
|
|
|
* GP but not blocking the already-waiting expedited GP.
|
|
|
|
* Queue at the head of the list to avoid unnecessarily
|
|
|
|
* blocking the already-waiting GPs.
|
|
|
|
*/
|
|
|
|
list_add(&t->rcu_node_entry, &rnp->blkd_tasks);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_BLKD:
|
|
|
|
case RCU_GP_BLKD + RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_GP_BLKD + RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_TASKS + RCU_GP_BLKD + RCU_EXP_BLKD:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First task arriving that blocks either GP, or first task
|
|
|
|
* arriving that blocks the expedited GP (with the normal
|
|
|
|
* GP already waiting), or a task arriving that blocks
|
|
|
|
* both GPs with both GPs already waiting. Queue at the
|
|
|
|
* tail of the list to avoid any GP waiting on any of the
|
|
|
|
* already queued tasks that are not blocking it.
|
|
|
|
*/
|
|
|
|
list_add_tail(&t->rcu_node_entry, &rnp->blkd_tasks);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case RCU_EXP_TASKS + RCU_EXP_BLKD:
|
|
|
|
case RCU_EXP_TASKS + RCU_GP_BLKD + RCU_EXP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_TASKS + RCU_EXP_BLKD:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Second or subsequent task blocking the expedited GP.
|
|
|
|
* The task either does not block the normal GP, or is the
|
|
|
|
* first task blocking the normal GP. Queue just after
|
|
|
|
* the first task blocking the expedited GP.
|
|
|
|
*/
|
|
|
|
list_add(&t->rcu_node_entry, rnp->exp_tasks);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case RCU_GP_TASKS + RCU_GP_BLKD:
|
|
|
|
case RCU_GP_TASKS + RCU_EXP_TASKS + RCU_GP_BLKD:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Second or subsequent task blocking the normal GP.
|
|
|
|
* The task does not block the expedited GP. Queue just
|
|
|
|
* after the first task blocking the normal GP.
|
|
|
|
*/
|
|
|
|
list_add(&t->rcu_node_entry, rnp->gp_tasks);
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
|
|
|
|
/* Yet another exercise in excessive paranoia. */
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have now queued the task. If it was the first one to
|
|
|
|
* block either grace period, update the ->gp_tasks and/or
|
|
|
|
* ->exp_tasks pointers, respectively, to reference the newly
|
|
|
|
* blocked tasks.
|
|
|
|
*/
|
2017-11-27 15:13:56 -08:00
|
|
|
if (!rnp->gp_tasks && (blkd_state & RCU_GP_BLKD)) {
|
2019-10-09 14:21:54 -07:00
|
|
|
WRITE_ONCE(rnp->gp_tasks, &t->rcu_node_entry);
|
2018-04-28 18:50:06 -07:00
|
|
|
WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq);
|
2017-11-27 15:13:56 -08:00
|
|
|
}
|
2015-08-02 13:53:17 -07:00
|
|
|
if (!rnp->exp_tasks && (blkd_state & RCU_EXP_BLKD))
|
2020-01-03 14:18:12 -08:00
|
|
|
WRITE_ONCE(rnp->exp_tasks, &t->rcu_node_entry);
|
2017-07-11 21:52:31 -07:00
|
|
|
WARN_ON_ONCE(!(blkd_state & RCU_GP_BLKD) !=
|
|
|
|
!(rnp->qsmask & rdp->grpmask));
|
|
|
|
WARN_ON_ONCE(!(blkd_state & RCU_EXP_BLKD) !=
|
|
|
|
!(rnp->expmask & rdp->grpmask));
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_rcu_node(rnp); /* interrupts remain disabled. */
|
2015-08-02 13:53:17 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Report the quiescent state for the expedited GP. This expedited
|
|
|
|
* GP should not be able to end until we report, so there should be
|
|
|
|
* no need to check for a subsequent expedited GP. (Though we are
|
|
|
|
* still in a quiescent state in any case.)
|
|
|
|
*/
|
2019-03-27 15:51:25 -07:00
|
|
|
if (blkd_state & RCU_EXP_BLKD && rdp->exp_deferred_qs)
|
2018-07-03 17:22:34 -07:00
|
|
|
rcu_report_exp_rdp(rdp);
|
2018-06-28 07:39:59 -07:00
|
|
|
else
|
2019-03-27 15:51:25 -07:00
|
|
|
WARN_ON_ONCE(rdp->exp_deferred_qs);
|
2015-08-02 13:53:17 -07:00
|
|
|
}
|
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
/*
|
2018-05-16 18:00:17 -07:00
|
|
|
* Record a preemptible-RCU quiescent state for the specified CPU.
|
|
|
|
* Note that this does not necessarily mean that the task currently running
|
|
|
|
* on the CPU is in a quiescent state: Instead, it means that the current
|
|
|
|
* grace period need not wait on any RCU read-side critical section that
|
|
|
|
* starts later on this CPU. It also means that if the current task is
|
|
|
|
* in an RCU read-side critical section, it has already added itself to
|
|
|
|
* some leaf rcu_node structure's ->blkd_tasks list. In addition to the
|
|
|
|
* current task, there might be any number of other tasks blocked while
|
|
|
|
* in an RCU read-side critical section.
|
rcu: refactor RCU's context-switch handling
The addition of preemptible RCU to treercu resulted in a bit of
confusion and inefficiency surrounding the handling of context switches
for RCU-sched and for RCU-preempt. For RCU-sched, a context switch
is a quiescent state, pure and simple, just like it always has been.
For RCU-preempt, a context switch is in no way a quiescent state, but
special handling is required when a task blocks in an RCU read-side
critical section.
However, the callout from the scheduler and the outer loop in ksoftirqd
still calls something named rcu_sched_qs(), whose name is no longer
accurate. Furthermore, when rcu_check_callbacks() notes an RCU-sched
quiescent state, it ends up unnecessarily (though harmlessly, aside
from the performance hit) enqueuing the current task if it happens to
be running in an RCU-preempt read-side critical section. This not only
increases the maximum latency of scheduler_tick(), it also needlessly
increases the overhead of the next outermost rcu_read_unlock() invocation.
This patch addresses this situation by separating the notion of RCU's
context-switch handling from that of RCU-sched's quiescent states.
The context-switch handling is covered by rcu_note_context_switch() in
general and by rcu_preempt_note_context_switch() for preemptible RCU.
This permits rcu_sched_qs() to handle quiescent states and only quiescent
states. It also reduces the maximum latency of scheduler_tick(), though
probably by much less than a microsecond. Finally, it means that tasks
within preemptible-RCU read-side critical sections avoid incurring the
overhead of queuing unless there really is a context switch.
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
2010-04-01 17:37:01 -07:00
|
|
|
*
|
2018-05-16 18:00:17 -07:00
|
|
|
* Callers to this function must disable preemption.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
*/
|
2018-07-02 14:30:37 -07:00
|
|
|
static void rcu_qs(void)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
{
|
2018-07-02 14:30:37 -07:00
|
|
|
RCU_LOCKDEP_WARN(preemptible(), "rcu_qs() invoked with preemption enabled!!!\n");
|
2018-07-03 15:54:39 -07:00
|
|
|
if (__this_cpu_read(rcu_data.cpu_no_qs.s)) {
|
2014-08-14 16:38:46 -07:00
|
|
|
trace_rcu_grace_period(TPS("rcu_preempt"),
|
2018-07-03 15:54:39 -07:00
|
|
|
__this_cpu_read(rcu_data.gp_seq),
|
2014-08-14 16:38:46 -07:00
|
|
|
TPS("cpuqs"));
|
2018-07-03 15:54:39 -07:00
|
|
|
__this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
|
2018-11-21 11:35:03 -08:00
|
|
|
barrier(); /* Coordinate with rcu_flavor_sched_clock_irq(). */
|
2019-03-26 10:22:22 -07:00
|
|
|
WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, false);
|
2014-08-14 16:38:46 -07:00
|
|
|
}
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2009-09-13 09:15:10 -07:00
|
|
|
* We have entered the scheduler, and the current task might soon be
|
|
|
|
* context-switched away from. If this task is in an RCU read-side
|
|
|
|
* critical section, we will no longer be able to rely on the CPU to
|
2010-11-29 21:56:39 -08:00
|
|
|
* record that fact, so we enqueue the task on the blkd_tasks list.
|
|
|
|
* The task will dequeue itself when it exits the outermost enclosing
|
|
|
|
* RCU read-side critical section. Therefore, the current grace period
|
|
|
|
* cannot be permitted to complete until the blkd_tasks list entries
|
|
|
|
* predating the current grace period drain, in other words, until
|
|
|
|
* rnp->gp_tasks becomes NULL.
|
2009-09-13 09:15:10 -07:00
|
|
|
*
|
2015-10-07 09:10:48 -07:00
|
|
|
* Caller must disable interrupts.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
*/
|
2018-07-02 14:30:37 -07:00
|
|
|
void rcu_note_context_switch(bool preempt)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
{
|
|
|
|
struct task_struct *t = current;
|
2018-07-03 15:37:16 -07:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
struct rcu_node *rnp;
|
|
|
|
|
2018-07-02 14:30:37 -07:00
|
|
|
trace_rcu_utilization(TPS("Start context switch"));
|
2017-11-06 16:01:30 +01:00
|
|
|
lockdep_assert_irqs_disabled();
|
2019-11-15 14:08:53 -08:00
|
|
|
WARN_ON_ONCE(!preempt && rcu_preempt_depth() > 0);
|
|
|
|
if (rcu_preempt_depth() > 0 &&
|
2014-08-14 16:01:53 -07:00
|
|
|
!t->rcu_read_unlock_special.b.blocked) {
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
|
|
|
/* Possibly blocking in an RCU read-side critical section. */
|
|
|
|
rnp = rdp->mynode;
|
2015-10-07 09:10:48 -07:00
|
|
|
raw_spin_lock_rcu_node(rnp);
|
2014-08-14 16:01:53 -07:00
|
|
|
t->rcu_read_unlock_special.b.blocked = true;
|
2009-08-27 15:00:12 -07:00
|
|
|
t->rcu_blocked_node = rnp;
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
|
|
|
/*
|
2015-08-02 13:53:17 -07:00
|
|
|
* Verify the CPU's sanity, trace the preemption, and
|
|
|
|
* then queue the task as required based on the states
|
|
|
|
* of any ongoing and expedited grace periods.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
*/
|
rcu: Process offlining and onlining only at grace-period start
Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events. Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.
This commit avoids these problems by buffering online and offline events
in a new ->qsmaskinitnext field in the leaf rcu_node structures. When a
grace period starts, the events accumulated in this mask are applied to
the ->qsmaskinit field, and, if needed, up the rcu_node tree. The special
case of all CPUs corresponding to a given leaf rcu_node structure being
offline while there are still elements in that structure's ->blkd_tasks
list is handled using a new ->wait_blkd_tasks field. In this case,
propagating the offline bits up the tree is deferred until the beginning
of the grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared. If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.
This of course means that RCU's notion of which CPUs are offline can be
out of date. This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started. In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-23 21:52:37 -08:00
|
|
|
WARN_ON_ONCE((rdp->grpmask & rcu_rnp_online_cpus(rnp)) == 0);
|
2009-09-18 09:50:18 -07:00
|
|
|
WARN_ON_ONCE(!list_empty(&t->rcu_node_entry));
|
2018-07-04 14:45:00 -07:00
|
|
|
trace_rcu_preempt_task(rcu_state.name,
|
rcu: Add grace-period, quiescent-state, and call_rcu trace events
Add trace events to record grace-period start and end, quiescent states,
CPUs noticing grace-period start and end, grace-period initialization,
call_rcu() invocation, tasks blocking in RCU read-side critical sections,
tasks exiting those same critical sections, force_quiescent_state()
detection of dyntick-idle and offline CPUs, CPUs entering and leaving
dyntick-idle mode (except from NMIs), CPUs coming online and going
offline, and CPUs being kicked for staying in dyntick-idle mode for too
long (as in many weeks, even on 32-bit systems).
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
rcu: Add the rcu flavor to callback trace events
The earlier trace events for registering RCU callbacks and for invoking
them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched).
This commit adds the RCU flavor to those trace events.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-06-25 06:36:56 -07:00
|
|
|
t->pid,
|
|
|
|
(rnp->qsmask & rdp->grpmask)
|
2018-05-01 13:35:20 -07:00
|
|
|
? rnp->gp_seq
|
|
|
|
: rcu_seq_snap(&rnp->gp_seq));
|
2015-10-07 09:10:48 -07:00
|
|
|
rcu_preempt_ctxt_queue(rnp, rdp);
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
} else {
|
|
|
|
rcu_preempt_deferred_qs(t);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Either we were not in an RCU read-side critical section to
|
|
|
|
* begin with, or we have now recorded that critical section
|
|
|
|
* globally. Either way, we can now note a quiescent state
|
|
|
|
* for this CPU. Again, if we were in an RCU read-side critical
|
|
|
|
* section, and if that critical section was blocking the current
|
|
|
|
* grace period, then the fact that the task has been enqueued
|
|
|
|
* means that we continue to block the current grace period.
|
|
|
|
*/
|
2018-07-02 14:30:37 -07:00
|
|
|
rcu_qs();
|
2019-03-27 15:51:25 -07:00
|
|
|
if (rdp->exp_deferred_qs)
|
2018-07-03 17:22:34 -07:00
|
|
|
rcu_report_exp_rdp(rdp);
|
2020-03-16 20:38:29 -07:00
|
|
|
rcu_tasks_qs(current, preempt);
|
2018-07-02 14:30:37 -07:00
|
|
|
trace_rcu_utilization(TPS("End context switch"));
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
}
|
2018-07-02 14:30:37 -07:00
|
|
|
EXPORT_SYMBOL_GPL(rcu_note_context_switch);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
2009-09-23 09:50:41 -07:00
|
|
|
/*
|
|
|
|
* Check for preempted RCU readers blocking the current grace period
|
|
|
|
* for the specified rcu_node structure. If the caller needs a reliable
|
|
|
|
* answer, it must hold the rcu_node's ->lock.
|
|
|
|
*/
|
2011-02-07 12:47:15 -08:00
|
|
|
static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
|
2009-09-23 09:50:41 -07:00
|
|
|
{
|
2019-10-09 14:21:54 -07:00
|
|
|
return READ_ONCE(rnp->gp_tasks) != NULL;
|
2009-09-23 09:50:41 -07:00
|
|
|
}
|
|
|
|
|
2020-02-15 15:23:26 -08:00
|
|
|
/* limit value for ->rcu_read_lock_nesting. */
|
2018-10-29 07:36:50 -07:00
|
|
|
#define RCU_NEST_PMAX (INT_MAX / 2)
|
|
|
|
|
2019-11-15 14:08:53 -08:00
|
|
|
static void rcu_preempt_read_enter(void)
|
|
|
|
{
|
|
|
|
current->rcu_read_lock_nesting++;
|
|
|
|
}
|
|
|
|
|
2020-02-15 15:23:26 -08:00
|
|
|
static int rcu_preempt_read_exit(void)
|
2019-11-15 14:08:53 -08:00
|
|
|
{
|
2020-02-15 15:23:26 -08:00
|
|
|
return --current->rcu_read_lock_nesting;
|
2019-11-15 14:08:53 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void rcu_preempt_depth_set(int val)
|
|
|
|
{
|
|
|
|
current->rcu_read_lock_nesting = val;
|
|
|
|
}
|
|
|
|
|
2018-03-19 08:05:04 -07:00
|
|
|
/*
|
|
|
|
* Preemptible RCU implementation for rcu_read_lock().
|
|
|
|
* Just increment ->rcu_read_lock_nesting, shared state will be updated
|
|
|
|
* if we block.
|
|
|
|
*/
|
|
|
|
void __rcu_read_lock(void)
|
|
|
|
{
|
2019-11-15 14:08:53 -08:00
|
|
|
rcu_preempt_read_enter();
|
2018-10-29 07:36:50 -07:00
|
|
|
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
|
2019-11-15 14:08:53 -08:00
|
|
|
WARN_ON_ONCE(rcu_preempt_depth() > RCU_NEST_PMAX);
|
2020-08-06 09:40:18 -07:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) && rcu_state.gp_kthread)
|
|
|
|
WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true);
|
2018-03-19 08:05:04 -07:00
|
|
|
barrier(); /* critical section after entry code. */
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__rcu_read_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Preemptible RCU implementation for rcu_read_unlock().
|
|
|
|
* Decrement ->rcu_read_lock_nesting. If the result is zero (outermost
|
|
|
|
* rcu_read_unlock()) and ->rcu_read_unlock_special is non-zero, then
|
|
|
|
* invoke rcu_read_unlock_special() to clean up after a context switch
|
|
|
|
* in an RCU read-side critical section and other special cases.
|
|
|
|
*/
|
|
|
|
void __rcu_read_unlock(void)
|
|
|
|
{
|
|
|
|
struct task_struct *t = current;
|
|
|
|
|
2020-02-15 15:23:26 -08:00
|
|
|
if (rcu_preempt_read_exit() == 0) {
|
2018-03-19 08:05:04 -07:00
|
|
|
barrier(); /* critical section before exit code. */
|
|
|
|
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
|
|
|
|
rcu_read_unlock_special(t);
|
|
|
|
}
|
2018-10-29 07:36:50 -07:00
|
|
|
if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
|
2019-11-15 14:08:53 -08:00
|
|
|
int rrln = rcu_preempt_depth();
|
2018-03-19 08:05:04 -07:00
|
|
|
|
2020-02-15 15:23:26 -08:00
|
|
|
WARN_ON_ONCE(rrln < 0 || rrln > RCU_NEST_PMAX);
|
2018-03-19 08:05:04 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__rcu_read_unlock);
|
|
|
|
|
2010-11-29 21:56:39 -08:00
|
|
|
/*
|
|
|
|
* Advance a ->blkd_tasks-list pointer to the next entry, instead
|
|
|
|
* returning NULL if at the end of the list.
|
|
|
|
*/
|
|
|
|
static struct list_head *rcu_next_node_entry(struct task_struct *t,
|
|
|
|
struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
struct list_head *np;
|
|
|
|
|
|
|
|
np = t->rcu_node_entry.next;
|
|
|
|
if (np == &rnp->blkd_tasks)
|
|
|
|
np = NULL;
|
|
|
|
return np;
|
|
|
|
}
|
|
|
|
|
2014-10-31 11:22:37 -07:00
|
|
|
/*
|
|
|
|
* Return true if the specified rcu_node structure has tasks that were
|
|
|
|
* preempted within an RCU read-side critical section.
|
|
|
|
*/
|
|
|
|
static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
return !list_empty(&rnp->blkd_tasks);
|
|
|
|
}
|
|
|
|
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-22 08:53:48 -08:00
|
|
|
/*
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
* Report deferred quiescent states. The deferral time can
|
|
|
|
* be quite short, for example, in the case of the call from
|
|
|
|
* rcu_read_unlock_special().
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-22 08:53:48 -08:00
|
|
|
*/
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
static void
|
|
|
|
rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
{
|
2014-10-31 12:05:04 -07:00
|
|
|
bool empty_exp;
|
|
|
|
bool empty_norm;
|
|
|
|
bool empty_exp_now;
|
2010-11-29 21:56:39 -08:00
|
|
|
struct list_head *np;
|
2014-06-12 13:30:25 -07:00
|
|
|
bool drop_boost_mutex = false;
|
2015-08-02 13:53:17 -07:00
|
|
|
struct rcu_data *rdp;
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
struct rcu_node *rnp;
|
2014-08-14 16:01:53 -07:00
|
|
|
union rcu_special special;
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
|
|
|
/*
|
2015-08-02 13:53:17 -07:00
|
|
|
* If RCU core is waiting for this CPU to exit its critical section,
|
|
|
|
* report the fact that it has exited. Because irqs are disabled,
|
2014-08-14 16:01:53 -07:00
|
|
|
* t->rcu_read_unlock_special cannot change.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
*/
|
|
|
|
special = t->rcu_read_unlock_special;
|
2018-07-03 15:37:16 -07:00
|
|
|
rdp = this_cpu_ptr(&rcu_data);
|
2019-03-27 15:51:25 -07:00
|
|
|
if (!special.s && !rdp->exp_deferred_qs) {
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
local_irq_restore(flags);
|
|
|
|
return;
|
|
|
|
}
|
2019-11-01 05:06:21 -07:00
|
|
|
t->rcu_read_unlock_special.s = 0;
|
rcu: Do full report for .need_qs for strict GPs
The rcu_preempt_deferred_qs_irqrestore() function is invoked at
the end of an RCU read-side critical section (for example, directly
from rcu_read_unlock()) and, if .need_qs is set, invokes rcu_qs() to
report the new quiescent state. This works, except that rcu_qs() only
updates per-CPU state, leaving reporting of the actual quiescent state
to a later call to rcu_report_qs_rdp(), for example from within a later
RCU_SOFTIRQ instance. Although this approach is exactly what you want if
you are more concerned about efficiency than about short grace periods,
in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, short grace periods are
the name of the game.
This commit therefore makes rcu_preempt_deferred_qs_irqrestore() directly
invoke rcu_report_qs_rdp() in CONFIG_RCU_STRICT_GRACE_PERIOD=y, thus
shortening grace periods.
Historical note: To the best of my knowledge, causing rcu_read_unlock()
to directly report a quiescent state first appeared in Jim Houston's
and Joe Korty's JRCU. This is the second instance of a Linux-kernel RCU
feature being inspired by JRCU, the first being RCU callback offloading
(as in the RCU_NOCB_CPU Kconfig option).
Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-06 15:12:50 -07:00
|
|
|
if (special.b.need_qs) {
|
2020-08-07 13:44:10 -07:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) {
|
2020-08-20 11:26:14 -07:00
|
|
|
rcu_report_qs_rdp(rdp);
|
2020-08-07 13:44:10 -07:00
|
|
|
udelay(rcu_unlock_delay);
|
|
|
|
} else {
|
rcu: Do full report for .need_qs for strict GPs
The rcu_preempt_deferred_qs_irqrestore() function is invoked at
the end of an RCU read-side critical section (for example, directly
from rcu_read_unlock()) and, if .need_qs is set, invokes rcu_qs() to
report the new quiescent state. This works, except that rcu_qs() only
updates per-CPU state, leaving reporting of the actual quiescent state
to a later call to rcu_report_qs_rdp(), for example from within a later
RCU_SOFTIRQ instance. Although this approach is exactly what you want if
you are more concerned about efficiency than about short grace periods,
in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, short grace periods are
the name of the game.
This commit therefore makes rcu_preempt_deferred_qs_irqrestore() directly
invoke rcu_report_qs_rdp() in CONFIG_RCU_STRICT_GRACE_PERIOD=y, thus
shortening grace periods.
Historical note: To the best of my knowledge, causing rcu_read_unlock()
to directly report a quiescent state first appeared in Jim Houston's
and Joe Korty's JRCU. This is the second instance of a Linux-kernel RCU
feature being inspired by JRCU, the first being RCU callback offloading
(as in the RCU_NOCB_CPU Kconfig option).
Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-06 15:12:50 -07:00
|
|
|
rcu_qs();
|
2020-08-07 13:44:10 -07:00
|
|
|
}
|
rcu: Do full report for .need_qs for strict GPs
The rcu_preempt_deferred_qs_irqrestore() function is invoked at
the end of an RCU read-side critical section (for example, directly
from rcu_read_unlock()) and, if .need_qs is set, invokes rcu_qs() to
report the new quiescent state. This works, except that rcu_qs() only
updates per-CPU state, leaving reporting of the actual quiescent state
to a later call to rcu_report_qs_rdp(), for example from within a later
RCU_SOFTIRQ instance. Although this approach is exactly what you want if
you are more concerned about efficiency than about short grace periods,
in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, short grace periods are
the name of the game.
This commit therefore makes rcu_preempt_deferred_qs_irqrestore() directly
invoke rcu_report_qs_rdp() in CONFIG_RCU_STRICT_GRACE_PERIOD=y, thus
shortening grace periods.
Historical note: To the best of my knowledge, causing rcu_read_unlock()
to directly report a quiescent state first appeared in Jim Houston's
and Joe Korty's JRCU. This is the second instance of a Linux-kernel RCU
feature being inspired by JRCU, the first being RCU callback offloading
(as in the RCU_NOCB_CPU Kconfig option).
Reported-by Jann Horn <jannh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-06 15:12:50 -07:00
|
|
|
}
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
2015-08-02 13:53:17 -07:00
|
|
|
/*
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
* Respond to a request by an expedited grace period for a
|
|
|
|
* quiescent state from this CPU. Note that requests from
|
|
|
|
* tasks are handled when removing the task from the
|
|
|
|
* blocked-tasks list below.
|
2015-08-02 13:53:17 -07:00
|
|
|
*/
|
2019-11-01 05:06:21 -07:00
|
|
|
if (rdp->exp_deferred_qs)
|
2018-07-03 17:22:34 -07:00
|
|
|
rcu_report_exp_rdp(rdp);
|
2015-08-02 13:53:17 -07:00
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
/* Clean up if blocked during RCU read-side critical section. */
|
2014-08-14 16:01:53 -07:00
|
|
|
if (special.b.blocked) {
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
2009-08-27 14:58:16 -07:00
|
|
|
/*
|
2015-03-08 14:20:30 -07:00
|
|
|
* Remove this task from the list it blocked on. The task
|
2015-09-29 07:55:41 -07:00
|
|
|
* now remains queued on the rcu_node corresponding to the
|
|
|
|
* CPU it first blocked on, so there is no longer any need
|
|
|
|
* to loop. Retain a WARN_ON_ONCE() out of sheer paranoia.
|
2009-08-27 14:58:16 -07:00
|
|
|
*/
|
2015-09-29 07:55:41 -07:00
|
|
|
rnp = t->rcu_blocked_node;
|
|
|
|
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
|
|
|
|
WARN_ON_ONCE(rnp != t->rcu_blocked_node);
|
2018-04-13 17:11:44 -07:00
|
|
|
WARN_ON_ONCE(!rcu_is_leaf_node(rnp));
|
2014-10-30 21:08:53 -07:00
|
|
|
empty_norm = !rcu_preempt_blocked_readers_cgp(rnp);
|
2018-04-28 18:50:06 -07:00
|
|
|
WARN_ON_ONCE(rnp->completedqs == rnp->gp_seq &&
|
2017-11-27 15:13:56 -08:00
|
|
|
(!empty_norm || rnp->qsmask));
|
2019-11-27 13:59:37 -08:00
|
|
|
empty_exp = sync_rcu_exp_done(rnp);
|
2009-12-02 12:10:15 -08:00
|
|
|
smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
|
2010-11-29 21:56:39 -08:00
|
|
|
np = rcu_next_node_entry(t, rnp);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
list_del_init(&t->rcu_node_entry);
|
2011-08-04 07:55:34 -07:00
|
|
|
t->rcu_blocked_node = NULL;
|
2013-07-12 17:18:47 -04:00
|
|
|
trace_rcu_unlock_preempted_task(TPS("rcu_preempt"),
|
2018-05-01 13:35:20 -07:00
|
|
|
rnp->gp_seq, t->pid);
|
2010-11-29 21:56:39 -08:00
|
|
|
if (&t->rcu_node_entry == rnp->gp_tasks)
|
2019-10-09 14:21:54 -07:00
|
|
|
WRITE_ONCE(rnp->gp_tasks, np);
|
2010-11-29 21:56:39 -08:00
|
|
|
if (&t->rcu_node_entry == rnp->exp_tasks)
|
2020-01-03 14:18:12 -08:00
|
|
|
WRITE_ONCE(rnp->exp_tasks, np);
|
2015-03-03 14:49:26 -08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_BOOST)) {
|
|
|
|
/* Snapshot ->boost_mtx ownership w/rnp->lock held. */
|
|
|
|
drop_boost_mutex = rt_mutex_owner(&rnp->boost_mtx) == t;
|
2017-07-11 21:52:31 -07:00
|
|
|
if (&t->rcu_node_entry == rnp->boost_tasks)
|
2020-01-04 10:44:41 -08:00
|
|
|
WRITE_ONCE(rnp->boost_tasks, np);
|
2015-03-03 14:49:26 -08:00
|
|
|
}
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this was the last task on the current list, and if
|
|
|
|
* we aren't waiting on any CPUs, report the quiescent state.
|
2011-09-21 14:41:37 -07:00
|
|
|
* Note that rcu_report_unblock_qs_rnp() releases rnp->lock,
|
|
|
|
* so we must take a snapshot of the expedited state.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
*/
|
2019-11-27 13:59:37 -08:00
|
|
|
empty_exp_now = sync_rcu_exp_done(rnp);
|
2014-10-30 21:08:53 -07:00
|
|
|
if (!empty_norm && !rcu_preempt_blocked_readers_cgp(rnp)) {
|
2013-07-12 17:18:47 -04:00
|
|
|
trace_rcu_quiescent_state_report(TPS("preempt_rcu"),
|
2018-05-01 13:35:20 -07:00
|
|
|
rnp->gp_seq,
|
rcu: Add grace-period, quiescent-state, and call_rcu trace events
Add trace events to record grace-period start and end, quiescent states,
CPUs noticing grace-period start and end, grace-period initialization,
call_rcu() invocation, tasks blocking in RCU read-side critical sections,
tasks exiting those same critical sections, force_quiescent_state()
detection of dyntick-idle and offline CPUs, CPUs entering and leaving
dyntick-idle mode (except from NMIs), CPUs coming online and going
offline, and CPUs being kicked for staying in dyntick-idle mode for too
long (as in many weeks, even on 32-bit systems).
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
rcu: Add the rcu flavor to callback trace events
The earlier trace events for registering RCU callbacks and for invoking
them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched).
This commit adds the RCU flavor to those trace events.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-06-25 06:36:56 -07:00
|
|
|
0, rnp->qsmask,
|
|
|
|
rnp->level,
|
|
|
|
rnp->grplo,
|
|
|
|
rnp->grphi,
|
|
|
|
!!rnp->gp_tasks);
|
2018-07-03 17:22:34 -07:00
|
|
|
rcu_report_unblock_qs_rnp(rnp, flags);
|
2012-06-28 08:08:25 -07:00
|
|
|
} else {
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2012-06-28 08:08:25 -07:00
|
|
|
}
|
2009-12-02 12:10:15 -08:00
|
|
|
|
2011-02-07 12:47:15 -08:00
|
|
|
/* Unboost if we were boosted. */
|
2015-03-03 14:49:26 -08:00
|
|
|
if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex)
|
2017-09-19 15:36:42 -07:00
|
|
|
rt_mutex_futex_unlock(&rnp->boost_mtx);
|
2011-02-07 12:47:15 -08:00
|
|
|
|
2009-12-02 12:10:15 -08:00
|
|
|
/*
|
|
|
|
* If this was the last task on the expedited lists,
|
|
|
|
* then we need to report up the rcu_node hierarchy.
|
|
|
|
*/
|
2011-09-21 14:41:37 -07:00
|
|
|
if (!empty_exp && empty_exp_now)
|
2018-07-03 17:22:34 -07:00
|
|
|
rcu_report_exp_rnp(rnp, true);
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-22 08:53:48 -08:00
|
|
|
} else {
|
|
|
|
local_irq_restore(flags);
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
/*
|
|
|
|
* Is a deferred quiescent-state pending, and are we also not in
|
|
|
|
* an RCU read-side critical section? It is the caller's responsibility
|
|
|
|
* to ensure it is otherwise safe to report any deferred quiescent
|
|
|
|
* states. The reason for this is that it is safe to report a
|
|
|
|
* quiescent state during context switch even though preemption
|
|
|
|
* is disabled. This function cannot be expected to understand these
|
|
|
|
* nuances, so the caller must handle them.
|
|
|
|
*/
|
|
|
|
static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
|
|
|
|
{
|
2019-03-27 15:51:25 -07:00
|
|
|
return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
READ_ONCE(t->rcu_read_unlock_special.s)) &&
|
2020-02-15 15:23:26 -08:00
|
|
|
rcu_preempt_depth() == 0;
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Report a deferred quiescent state if needed and safe to do so.
|
|
|
|
* As with rcu_preempt_need_deferred_qs(), "safe" involves only
|
|
|
|
* not being in an RCU read-side critical section. The caller must
|
|
|
|
* evaluate safety in terms of interrupt, softirq, and preemption
|
|
|
|
* disabling.
|
|
|
|
*/
|
|
|
|
static void rcu_preempt_deferred_qs(struct task_struct *t)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
if (!rcu_preempt_need_deferred_qs(t))
|
|
|
|
return;
|
|
|
|
local_irq_save(flags);
|
|
|
|
rcu_preempt_deferred_qs_irqrestore(t, flags);
|
|
|
|
}
|
|
|
|
|
2019-04-04 12:19:25 -07:00
|
|
|
/*
|
|
|
|
* Minimal handler to give the scheduler a chance to re-evaluate.
|
|
|
|
*/
|
|
|
|
static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
|
|
|
|
{
|
|
|
|
struct rcu_data *rdp;
|
|
|
|
|
|
|
|
rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
|
|
|
|
rdp->defer_qs_iw_pending = false;
|
|
|
|
}
|
|
|
|
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
/*
|
|
|
|
* Handle special cases during rcu_read_unlock(), such as needing to
|
|
|
|
* notify RCU core processing or task having blocked during the RCU
|
|
|
|
* read-side critical section.
|
|
|
|
*/
|
|
|
|
static void rcu_read_unlock_special(struct task_struct *t)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
bool preempt_bh_were_disabled =
|
|
|
|
!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
|
|
|
|
bool irqs_were_disabled;
|
|
|
|
|
|
|
|
/* NMI handlers cannot block and cannot safely manipulate state. */
|
|
|
|
if (in_nmi())
|
|
|
|
return;
|
|
|
|
|
|
|
|
local_irq_save(flags);
|
|
|
|
irqs_were_disabled = irqs_disabled_flags(flags);
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 04:12:58 -07:00
|
|
|
if (preempt_bh_were_disabled || irqs_were_disabled) {
|
2019-04-01 14:12:50 -07:00
|
|
|
bool exp;
|
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
|
|
|
struct rcu_node *rnp = rdp->mynode;
|
|
|
|
|
2020-02-15 14:18:09 -08:00
|
|
|
exp = (t->rcu_blocked_node &&
|
|
|
|
READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
|
|
|
|
(rdp->grpmask & READ_ONCE(rnp->expmask));
|
rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
kthreads, a full and unconditional wakeup is required to initiate RCU
core processing. In contrast, when RCU core processing is carried
out by RCU_SOFTIRQ, a raise_softirq() suffices. Of course, there are
situations where raise_softirq() does a full wakeup, but these do not
occur with normal usage of rcu_read_unlock().
The reason that full wakeups can be problematic is that the scheduler
sometimes invokes rcu_read_unlock() with its pi or rq locks held,
which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
rcu_read_unlock() invokes the scheduler. Scheduler invocations can happen
in the following situations: (1) The just-ended reader has been subjected
to RCU priority boosting, in which case rcu_read_unlock() must deboost,
(2) Interrupts were disabled across the call to rcu_read_unlock(), so
the quiescent state must be deferred, requiring a wakeup of the rcuc
kthread corresponding to the current CPU.
Now, the scheduler may hold one of its locks across rcu_read_unlock()
only if preemption has been disabled across the entire RCU read-side
critical section, which in the days prior to RCU flavor consolidation
meant that rcu_read_unlock() never needed to do wakeups. However, this
is no longer the case for any but the first rcu_read_unlock() following a
condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
attention. For example, an RCU read-side critical section might be
preempted, but preemption might be disabled across the rcu_read_unlock().
The rcu_read_unlock() must defer the quiescent state, and therefore
leaves the task queued on its leaf rcu_node structure. If a scheduler
interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
one of its locks held. However, the preempted task is still queued, so
rcu_read_unlock() will attempt to defer the quiescent state once more.
When RCU core processing is carried out by RCU_SOFTIRQ, this works just
fine: The raise_softirq() function simply sets a bit in a per-CPU mask
and the RCU core processing will be undertaken upon return from interrupt.
Not so when RCU core processing is carried out by the rcuc kthread: In this
case, the required wakeup can result in deadlock.
The initial solution to this problem was to use set_tsk_need_resched() and
set_preempt_need_resched() to force a future context switch, which allows
rcu_preempt_note_context_switch() to report the deferred quiescent state
to RCU's core processing. Unfortunately for expedited grace periods,
there can be a significant delay between the call for a context switch
and the actual context switch.
This commit therefore introduces a ->deferred_qs flag to the task_struct
structure's rcu_special structure. This flag is initially false, and
is set to true by the first call to rcu_read_unlock() requiring special
attention, then finally reset back to false when the quiescent state is
finally reported. Then rcu_read_unlock() attempts full wakeups only when
->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
special attention. Note that a chain of RCU readers linked by some other
sort of reader may find that a later rcu_read_unlock() is once again able
to do a full wakeup, courtesy of an intervening preemption:
rcu_read_lock();
/* preempted */
local_irq_disable();
rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
rcu_read_lock();
local_irq_enable();
preempt_disable()
rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
rcu_read_lock();
preempt_enable();
/* preempted, >deferred_qs reset. */
local_irq_disable();
rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */
Such linked RCU readers do not yet seem to appear in the Linux kernel, and
it is probably best if they don't. However, RCU needs to handle them, and
some variations on this theme could make even raise_softirq() unsafe due to
the possibility of its doing a full wakeup. This commit therefore also
avoids invoking raise_softirq() when the ->deferred_qs set flag is set.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-03-24 15:25:51 -07:00
|
|
|
// Need to defer quiescent state until everything is enabled.
|
2020-02-15 14:18:09 -08:00
|
|
|
if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
|
|
|
|
// Using softirq, safe to awaken, and either the
|
|
|
|
// wakeup is free or there is an expedited GP.
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 04:12:58 -07:00
|
|
|
raise_softirq_irqoff(RCU_SOFTIRQ);
|
|
|
|
} else {
|
rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
kthreads, a full and unconditional wakeup is required to initiate RCU
core processing. In contrast, when RCU core processing is carried
out by RCU_SOFTIRQ, a raise_softirq() suffices. Of course, there are
situations where raise_softirq() does a full wakeup, but these do not
occur with normal usage of rcu_read_unlock().
The reason that full wakeups can be problematic is that the scheduler
sometimes invokes rcu_read_unlock() with its pi or rq locks held,
which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
rcu_read_unlock() invokes the scheduler. Scheduler invocations can happen
in the following situations: (1) The just-ended reader has been subjected
to RCU priority boosting, in which case rcu_read_unlock() must deboost,
(2) Interrupts were disabled across the call to rcu_read_unlock(), so
the quiescent state must be deferred, requiring a wakeup of the rcuc
kthread corresponding to the current CPU.
Now, the scheduler may hold one of its locks across rcu_read_unlock()
only if preemption has been disabled across the entire RCU read-side
critical section, which in the days prior to RCU flavor consolidation
meant that rcu_read_unlock() never needed to do wakeups. However, this
is no longer the case for any but the first rcu_read_unlock() following a
condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
attention. For example, an RCU read-side critical section might be
preempted, but preemption might be disabled across the rcu_read_unlock().
The rcu_read_unlock() must defer the quiescent state, and therefore
leaves the task queued on its leaf rcu_node structure. If a scheduler
interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
one of its locks held. However, the preempted task is still queued, so
rcu_read_unlock() will attempt to defer the quiescent state once more.
When RCU core processing is carried out by RCU_SOFTIRQ, this works just
fine: The raise_softirq() function simply sets a bit in a per-CPU mask
and the RCU core processing will be undertaken upon return from interrupt.
Not so when RCU core processing is carried out by the rcuc kthread: In this
case, the required wakeup can result in deadlock.
The initial solution to this problem was to use set_tsk_need_resched() and
set_preempt_need_resched() to force a future context switch, which allows
rcu_preempt_note_context_switch() to report the deferred quiescent state
to RCU's core processing. Unfortunately for expedited grace periods,
there can be a significant delay between the call for a context switch
and the actual context switch.
This commit therefore introduces a ->deferred_qs flag to the task_struct
structure's rcu_special structure. This flag is initially false, and
is set to true by the first call to rcu_read_unlock() requiring special
attention, then finally reset back to false when the quiescent state is
finally reported. Then rcu_read_unlock() attempts full wakeups only when
->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
special attention. Note that a chain of RCU readers linked by some other
sort of reader may find that a later rcu_read_unlock() is once again able
to do a full wakeup, courtesy of an intervening preemption:
rcu_read_lock();
/* preempted */
local_irq_disable();
rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
rcu_read_lock();
local_irq_enable();
preempt_disable()
rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
rcu_read_lock();
preempt_enable();
/* preempted, >deferred_qs reset. */
local_irq_disable();
rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */
Such linked RCU readers do not yet seem to appear in the Linux kernel, and
it is probably best if they don't. However, RCU needs to handle them, and
some variations on this theme could make even raise_softirq() unsafe due to
the possibility of its doing a full wakeup. This commit therefore also
avoids invoking raise_softirq() when the ->deferred_qs set flag is set.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-03-24 15:25:51 -07:00
|
|
|
// Enabling BH or preempt does reschedule, so...
|
2020-02-15 14:18:09 -08:00
|
|
|
// Also if no expediting, slow is OK.
|
|
|
|
// Plus nohz_full CPUs eventually get tick enabled.
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 04:12:58 -07:00
|
|
|
set_tsk_need_resched(current);
|
|
|
|
set_preempt_need_resched();
|
2019-06-22 12:05:54 -07:00
|
|
|
if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
|
rcu: Do not report strict GPs for outgoing CPUs
An outgoing CPU is marked offline in a stop-machine handler and most
of that CPU's services stop at that point, including IRQ work queues.
However, that CPU must take another pass through the scheduler and through
a number of CPU-hotplug notifiers, many of which contain RCU readers.
In the past, these readers were not a problem because the outgoing CPU
has interrupts disabled, so that rcu_read_unlock_special() would not
be invoked, and thus RCU would never attempt to queue IRQ work on the
outgoing CPU.
This changed with the advent of the CONFIG_RCU_STRICT_GRACE_PERIOD
Kconfig option, in which rcu_read_unlock_special() is invoked upon exit
from almost all RCU read-side critical sections. Worse yet, because
interrupts are disabled, rcu_read_unlock_special() cannot immediately
report a quiescent state and will therefore attempt to defer this
reporting, for example, by queueing IRQ work. Which fails with a splat
because the CPU is already marked as being offline.
But it turns out that there is no need to report this quiescent state
because rcu_report_dead() will do this job shortly after the outgoing
CPU makes its final dive into the idle loop. This commit therefore
makes rcu_read_unlock_special() refrain from queuing IRQ work onto
outgoing CPUs.
Fixes: 44bad5b3cca2 ("rcu: Do full report for .need_qs for strict GPs")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Jann Horn <jannh@google.com>
2020-10-30 13:11:24 -07:00
|
|
|
!rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) {
|
2019-04-04 12:19:25 -07:00
|
|
|
// Get scheduler to re-evaluate and call hooks.
|
|
|
|
// If !IRQ_WORK, FQS scan will eventually IPI.
|
|
|
|
init_irq_work(&rdp->defer_qs_iw,
|
|
|
|
rcu_preempt_deferred_qs_handler);
|
|
|
|
rdp->defer_qs_iw_pending = true;
|
|
|
|
irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
|
|
|
|
}
|
rcu: Speed up expedited GPs when interrupting RCU reader
In PREEMPT kernels, an expedited grace period might send an IPI to a
CPU that is executing an RCU read-side critical section. In that case,
it would be nice if the rcu_read_unlock() directly interacted with the
RCU core code to immediately report the quiescent state. And this does
happen in the case where the reader has been preempted. But it would
also be a nice performance optimization if immediate reporting also
happened in the preemption-free case.
This commit therefore adds an ->exp_hint field to the task_struct structure's
->rcu_read_unlock_special field. The IPI handler sets this hint when
it has interrupted an RCU read-side critical section, and this causes
the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
which, if preemption is enabled, reports the quiescent state immediately.
If preemption is disabled, then the report is required to be deferred
until preemption (or bottom halves or interrupts or whatever) is re-enabled.
Because this is a hint, it does nothing for more complicated cases. For
example, if the IPI interrupts an RCU reader, but interrupts are disabled
across the rcu_read_unlock(), but another rcu_read_lock() is executed
before interrupts are re-enabled, the hint will already have been cleared.
If you do crazy things like this, reporting will be deferred until some
later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-10-16 04:12:58 -07:00
|
|
|
}
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
local_irq_restore(flags);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
rcu_preempt_deferred_qs_irqrestore(t, flags);
|
|
|
|
}
|
|
|
|
|
2009-09-13 09:15:09 -07:00
|
|
|
/*
|
|
|
|
* Check that the list of blocked tasks for the newly completed grace
|
|
|
|
* period is in fact empty. It is a serious bug to complete a grace
|
|
|
|
* period that still has RCU readers blocked! This function must be
|
2019-10-10 09:05:27 -07:00
|
|
|
* invoked -before- updating this rnp's ->gp_seq.
|
2010-11-29 21:56:39 -08:00
|
|
|
*
|
|
|
|
* Also, if there are blocked tasks on the list, they automatically
|
|
|
|
* block the newly created grace period, so set up ->gp_tasks accordingly.
|
2009-09-13 09:15:09 -07:00
|
|
|
*/
|
2018-07-03 17:22:34 -07:00
|
|
|
static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
|
2009-09-13 09:15:09 -07:00
|
|
|
{
|
2017-06-19 10:32:23 -07:00
|
|
|
struct task_struct *t;
|
|
|
|
|
2017-04-28 13:19:28 -07:00
|
|
|
RCU_LOCKDEP_WARN(preemptible(), "rcu_preempt_check_blocked_tasks() invoked with preemption enabled!!!\n");
|
2019-10-10 09:05:27 -07:00
|
|
|
raw_lockdep_assert_held_rcu_node(rnp);
|
2017-11-27 15:13:56 -08:00
|
|
|
if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)))
|
2018-07-03 17:22:34 -07:00
|
|
|
dump_blkd_tasks(rnp, 10);
|
rcu: Suppress false-positive splats from mid-init task resume
Consider the following sequence of events in a PREEMPT=y kernel:
1. All CPUs corresponding to a given leaf rcu_node structure are
offline.
2. The first phase of the rcu_gp_init() function's grace-period
initialization runs, and sets that rcu_node structure's
->qsmaskinit to zero, as it should.
3. One of the CPUs corresponding to that rcu_node structure comes
back online. Note that because this CPU came online after the
grace period started, this grace period can safely ignore this
newly onlined CPU.
4. A task running on the newly onlined CPU enters an RCU-preempt
read-side critical section, and is then preempted. Because
the corresponding rcu_node structure's ->qsmask is zero,
rcu_preempt_ctxt_queue() leaves the rcu_node structure's
->gp_tasks field NULL, as it should.
5. The rcu_gp_init() function continues running the second phase of
grace-period initialization. The ->qsmask field of the parent of
the aforementioned leaf rcu_node structure is set to not expect
a quiescent state from the leaf, as is only right and proper.
However, when rcu_gp_init() reaches the leaf, it invokes
rcu_preempt_check_blocked_tasks(), which sees that the leaf's
->blkd_tasks list is non-empty, and therefore sets the leaf's
->gp_tasks field to reference the first task on that list.
6. The grace period ends before the preempted task resumes, which
is perfectly fine, given that this grace period was under no
obligation to wait for that task to exit its late-starting
RCU-preempt read-side critical section. Unfortunately, the
leaf's ->gp_tasks field is non-NULL, so rcu_gp_cleanup() splats.
After all, it appears to rcu_gp_cleanup() that the grace period
failed to wait for a task that was supposed to be blocking that
grace period.
This commit avoids this false-positive splat by adding a check of both
->qsmaskinit and ->wait_blkd_tasks to rcu_preempt_check_blocked_tasks().
If both ->qsmaskinit and ->wait_blkd_tasks are zero, then the task must
have entered its RCU-preempt read-side critical section late (after all,
the CPU that it is running on was not online at that time), which means
that the upper-level rcu_node structure won't be waiting for anything
on the leaf anyway.
If ->wait_blkd_tasks is non-zero, then there is at least one task on
ths rcu_node structure's ->blkd_tasks list whose RCU read-side
critical section predates the current grace period. If ->qsmaskinit
is non-zero, there is at least one CPU that was online at the start
of the current grace period. Thus, if both are zero, there is nothing
to wait for.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-05-08 16:18:28 -07:00
|
|
|
if (rcu_preempt_has_tasks(rnp) &&
|
|
|
|
(rnp->qsmaskinit || rnp->wait_blkd_tasks)) {
|
2019-10-09 14:21:54 -07:00
|
|
|
WRITE_ONCE(rnp->gp_tasks, rnp->blkd_tasks.next);
|
2017-06-19 10:32:23 -07:00
|
|
|
t = container_of(rnp->gp_tasks, struct task_struct,
|
|
|
|
rcu_node_entry);
|
|
|
|
trace_rcu_unlock_preempted_task(TPS("rcu_preempt-GPS"),
|
2018-05-01 13:35:20 -07:00
|
|
|
rnp->gp_seq, t->pid);
|
2017-06-19 10:32:23 -07:00
|
|
|
}
|
2009-09-18 09:50:17 -07:00
|
|
|
WARN_ON_ONCE(rnp->qsmask);
|
2009-09-13 09:15:09 -07:00
|
|
|
}
|
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
/*
|
2018-11-21 11:35:03 -08:00
|
|
|
* Check for a quiescent state from the current CPU, including voluntary
|
|
|
|
* context switches for Tasks RCU. When a task blocks, the task is
|
|
|
|
* recorded in the corresponding CPU's rcu_node structure, which is checked
|
|
|
|
* elsewhere, hence this function need only check for quiescent states
|
|
|
|
* related to the current CPU, not to those related to tasks.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
*/
|
2018-11-21 11:35:03 -08:00
|
|
|
static void rcu_flavor_sched_clock_irq(int user)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
{
|
|
|
|
struct task_struct *t = current;
|
|
|
|
|
2020-11-19 10:13:06 -08:00
|
|
|
lockdep_assert_irqs_disabled();
|
2018-07-02 14:30:37 -07:00
|
|
|
if (user || rcu_is_cpu_rrupt_from_idle()) {
|
|
|
|
rcu_note_voluntary_context_switch(current);
|
|
|
|
}
|
2019-11-15 14:08:53 -08:00
|
|
|
if (rcu_preempt_depth() > 0 ||
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
|
|
|
|
/* No QS, force context switch if deferred. */
|
2018-07-26 13:44:00 -07:00
|
|
|
if (rcu_preempt_need_deferred_qs(t)) {
|
|
|
|
set_tsk_need_resched(t);
|
|
|
|
set_preempt_need_resched();
|
|
|
|
}
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
} else if (rcu_preempt_need_deferred_qs(t)) {
|
|
|
|
rcu_preempt_deferred_qs(t); /* Report deferred QS. */
|
|
|
|
return;
|
2020-02-15 15:23:26 -08:00
|
|
|
} else if (!WARN_ON_ONCE(rcu_preempt_depth())) {
|
2018-07-02 14:30:37 -07:00
|
|
|
rcu_qs(); /* Report immediate QS. */
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
return;
|
|
|
|
}
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
|
|
|
|
/* If GP is oldish, ask for help from rcu_read_unlock_special(). */
|
2019-11-15 14:08:53 -08:00
|
|
|
if (rcu_preempt_depth() > 0 &&
|
2018-07-03 15:54:39 -07:00
|
|
|
__this_cpu_read(rcu_data.core_needs_qs) &&
|
|
|
|
__this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
|
2018-05-16 14:41:41 -07:00
|
|
|
!t->rcu_read_unlock_special.b.need_qs &&
|
2018-07-04 14:52:04 -07:00
|
|
|
time_after(jiffies, rcu_state.gp_start + HZ))
|
2014-08-14 16:01:53 -07:00
|
|
|
t->rcu_read_unlock_special.b.need_qs = true;
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
}
|
|
|
|
|
2013-04-11 10:15:52 -07:00
|
|
|
/*
|
|
|
|
* Check for a task exiting while in a preemptible-RCU read-side
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 07:21:29 -08:00
|
|
|
* critical section, clean up if so. No need to issue warnings, as
|
|
|
|
* debug_check_no_locks_held() already does this if lockdep is enabled.
|
|
|
|
* Besides, if this function does anything other than just immediately
|
|
|
|
* return, there was a bug of some sort. Spewing warnings from this
|
|
|
|
* function is like as not to simply obscure important prior warnings.
|
2013-04-11 10:15:52 -07:00
|
|
|
*/
|
|
|
|
void exit_rcu(void)
|
|
|
|
{
|
|
|
|
struct task_struct *t = current;
|
|
|
|
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 07:21:29 -08:00
|
|
|
if (unlikely(!list_empty(¤t->rcu_node_entry))) {
|
2019-11-15 14:08:53 -08:00
|
|
|
rcu_preempt_depth_set(1);
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 07:21:29 -08:00
|
|
|
barrier();
|
2019-03-26 10:22:22 -07:00
|
|
|
WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
|
2019-11-15 14:08:53 -08:00
|
|
|
} else if (unlikely(rcu_preempt_depth())) {
|
|
|
|
rcu_preempt_depth_set(1);
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 07:21:29 -08:00
|
|
|
} else {
|
2013-04-11 10:15:52 -07:00
|
|
|
return;
|
rcu: Make exit_rcu() handle non-preempted RCU readers
The purpose of exit_rcu() is to handle cases where buggy code causes a
task to exit within an RCU read-side critical section. It currently
does that in the case where said RCU read-side critical section was
preempted at least once, but fails to handle cases where preemption did
not occur. This case needs to be handled because otherwise the final
context switch away from the exiting task will incorrectly behave as if
task exit were instead a preemption of an RCU read-side critical section,
and will therefore queue the exiting task. The exiting task will have
exited, and thus won't ever execute rcu_read_unlock(), which means that
it will remain queued forever, blocking all subsequent grace periods,
and eventually resulting in OOM.
Although this is arguably better than letting grace periods proceed
and having a later rcu_read_unlock() access the now-freed task
structure that once belonged to the exiting tasks, it would obviously
be better to correctly handle this case. This commit therefore sets
->rcu_read_lock_nesting to 1 in that case, so that the subsequence call
to __rcu_read_unlock() causes the exiting task to exit its dangling RCU
read-side critical section.
Note that deferred quiescent states need not be considered. The reason
is that removing the task from the ->blkd_tasks[] list in the call to
rcu_preempt_deferred_qs() handles the per-task component of any deferred
quiescent state, and all other components of any deferred quiescent state
are associated with the CPU, which isn't going anywhere until some later
CPU-hotplug operation, which will report any remaining deferred quiescent
states from within the rcu_report_dead() function.
Note also that negative values of ->rcu_read_lock_nesting need not be
considered. First, these won't show up in exit_rcu() unless there is
a serious bug in RCU, and second, setting ->rcu_read_lock_nesting sets
the state so that the RCU read-side critical section will be exited
normally.
Again, this code has no effect unless there has been some prior bug
that prevents a task from leaving an RCU read-side critical section
before exiting. Furthermore, there have been no reports of the bug
fixed by this commit appearing in production. This commit is therefore
absolutely -not- recommended for backporting to -stable.
Reported-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
Reported-by: BHARATH Y MOURYA <bharathm@iisc.ac.in>
Reported-by: Aravinda Prasad <aravinda@iisc.ac.in>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: ABHISHEK DUBEY <dabhishek@iisc.ac.in>
2019-02-11 07:21:29 -08:00
|
|
|
}
|
2013-04-11 10:15:52 -07:00
|
|
|
__rcu_read_unlock();
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
rcu_preempt_deferred_qs(current);
|
2013-04-11 10:15:52 -07:00
|
|
|
}
|
|
|
|
|
2017-11-27 15:13:56 -08:00
|
|
|
/*
|
|
|
|
* Dump the blocked-tasks state, but limit the list dump to the
|
|
|
|
* specified number of elements.
|
|
|
|
*/
|
2018-05-08 14:18:57 -07:00
|
|
|
static void
|
2018-07-03 17:22:34 -07:00
|
|
|
dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
|
2017-11-27 15:13:56 -08:00
|
|
|
{
|
2018-05-08 14:18:57 -07:00
|
|
|
int cpu;
|
2017-11-27 15:13:56 -08:00
|
|
|
int i;
|
|
|
|
struct list_head *lhp;
|
2018-05-08 14:18:57 -07:00
|
|
|
bool onl;
|
|
|
|
struct rcu_data *rdp;
|
2018-05-08 12:50:14 -07:00
|
|
|
struct rcu_node *rnp1;
|
2017-11-27 15:13:56 -08:00
|
|
|
|
2018-03-09 09:32:18 +08:00
|
|
|
raw_lockdep_assert_held_rcu_node(rnp);
|
2018-05-08 12:50:14 -07:00
|
|
|
pr_info("%s: grp: %d-%d level: %d ->gp_seq %ld ->completedqs %ld\n",
|
2018-05-01 15:00:10 -07:00
|
|
|
__func__, rnp->grplo, rnp->grphi, rnp->level,
|
2020-01-04 11:33:17 -08:00
|
|
|
(long)READ_ONCE(rnp->gp_seq), (long)rnp->completedqs);
|
2018-05-08 12:50:14 -07:00
|
|
|
for (rnp1 = rnp; rnp1; rnp1 = rnp1->parent)
|
|
|
|
pr_info("%s: %d:%d ->qsmask %#lx ->qsmaskinit %#lx ->qsmaskinitnext %#lx\n",
|
|
|
|
__func__, rnp1->grplo, rnp1->grphi, rnp1->qsmask, rnp1->qsmaskinit, rnp1->qsmaskinitnext);
|
2018-05-01 15:00:10 -07:00
|
|
|
pr_info("%s: ->gp_tasks %p ->boost_tasks %p ->exp_tasks %p\n",
|
2020-01-03 15:22:01 -08:00
|
|
|
__func__, READ_ONCE(rnp->gp_tasks), data_race(rnp->boost_tasks),
|
2020-01-03 14:18:12 -08:00
|
|
|
READ_ONCE(rnp->exp_tasks));
|
2018-05-01 15:00:10 -07:00
|
|
|
pr_info("%s: ->blkd_tasks", __func__);
|
2017-11-27 15:13:56 -08:00
|
|
|
i = 0;
|
|
|
|
list_for_each(lhp, &rnp->blkd_tasks) {
|
|
|
|
pr_cont(" %p", lhp);
|
2019-03-29 15:25:52 +05:30
|
|
|
if (++i >= ncheck)
|
2017-11-27 15:13:56 -08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
pr_cont("\n");
|
2018-05-08 14:18:57 -07:00
|
|
|
for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++) {
|
2018-07-03 15:37:16 -07:00
|
|
|
rdp = per_cpu_ptr(&rcu_data, cpu);
|
2018-05-08 14:18:57 -07:00
|
|
|
onl = !!(rdp->grpmask & rcu_rnp_online_cpus(rnp));
|
|
|
|
pr_info("\t%d: %c online: %ld(%d) offline: %ld(%d)\n",
|
|
|
|
cpu, ".o"[onl],
|
|
|
|
(long)rdp->rcu_onl_gp_seq, rdp->rcu_onl_gp_flags,
|
|
|
|
(long)rdp->rcu_ofl_gp_seq, rdp->rcu_ofl_gp_flags);
|
|
|
|
}
|
2017-11-27 15:13:56 -08:00
|
|
|
}
|
|
|
|
|
2014-09-22 14:00:48 -04:00
|
|
|
#else /* #ifdef CONFIG_PREEMPT_RCU */
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
2020-08-10 09:58:03 -07:00
|
|
|
/*
|
|
|
|
* If strict grace periods are enabled, and if the calling
|
|
|
|
* __rcu_read_unlock() marks the beginning of a quiescent state, immediately
|
|
|
|
* report that quiescent state and, if requested, spin for a bit.
|
|
|
|
*/
|
|
|
|
void rcu_read_unlock_strict(void)
|
|
|
|
{
|
|
|
|
struct rcu_data *rdp;
|
|
|
|
|
|
|
|
if (!IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ||
|
|
|
|
irqs_disabled() || preempt_count() || !rcu_state.gp_kthread)
|
|
|
|
return;
|
|
|
|
rdp = this_cpu_ptr(&rcu_data);
|
2020-08-20 11:26:14 -07:00
|
|
|
rcu_report_qs_rdp(rdp);
|
2020-08-10 09:58:03 -07:00
|
|
|
udelay(rcu_unlock_delay);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rcu_read_unlock_strict);
|
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
/*
|
|
|
|
* Tell them what RCU they are running.
|
|
|
|
*/
|
2009-11-11 11:28:06 -08:00
|
|
|
static void __init rcu_bootup_announce(void)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
{
|
2013-03-18 16:24:11 -07:00
|
|
|
pr_info("Hierarchical RCU implementation.\n");
|
2010-04-13 14:19:23 -07:00
|
|
|
rcu_bootup_announce_oddness();
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
}
|
|
|
|
|
2018-07-02 14:30:37 -07:00
|
|
|
/*
|
2019-10-15 21:18:14 +02:00
|
|
|
* Note a quiescent state for PREEMPTION=n. Because we do not need to know
|
2018-07-02 14:30:37 -07:00
|
|
|
* how many quiescent states passed, just if there was at least one since
|
|
|
|
* the start of the grace period, this just sets a flag. The caller must
|
|
|
|
* have disabled preemption.
|
|
|
|
*/
|
|
|
|
static void rcu_qs(void)
|
2018-06-28 14:45:25 -07:00
|
|
|
{
|
2018-07-02 14:30:37 -07:00
|
|
|
RCU_LOCKDEP_WARN(preemptible(), "rcu_qs() invoked with preemption enabled!!!");
|
|
|
|
if (!__this_cpu_read(rcu_data.cpu_no_qs.s))
|
|
|
|
return;
|
|
|
|
trace_rcu_grace_period(TPS("rcu_sched"),
|
|
|
|
__this_cpu_read(rcu_data.gp_seq), TPS("cpuqs"));
|
|
|
|
__this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
|
|
|
|
if (!__this_cpu_read(rcu_data.cpu_no_qs.b.exp))
|
|
|
|
return;
|
|
|
|
__this_cpu_write(rcu_data.cpu_no_qs.b.exp, false);
|
2018-07-03 17:22:34 -07:00
|
|
|
rcu_report_exp_rdp(this_cpu_ptr(&rcu_data));
|
2018-06-28 14:45:25 -07:00
|
|
|
}
|
|
|
|
|
2018-07-10 14:00:14 -07:00
|
|
|
/*
|
|
|
|
* Register an urgently needed quiescent state. If there is an
|
|
|
|
* emergency, invoke rcu_momentary_dyntick_idle() to do a heavy-weight
|
|
|
|
* dyntick-idle quiescent state visible to other CPUs, which will in
|
|
|
|
* some cases serve for expedited as well as normal grace periods.
|
|
|
|
* Either way, register a lightweight quiescent state.
|
|
|
|
*/
|
|
|
|
void rcu_all_qs(void)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
2018-08-03 21:00:38 -07:00
|
|
|
if (!raw_cpu_read(rcu_data.rcu_urgent_qs))
|
2018-07-10 14:00:14 -07:00
|
|
|
return;
|
|
|
|
preempt_disable();
|
|
|
|
/* Load rcu_urgent_qs before other flags. */
|
2018-08-03 21:00:38 -07:00
|
|
|
if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
|
2018-07-10 14:00:14 -07:00
|
|
|
preempt_enable();
|
|
|
|
return;
|
|
|
|
}
|
2018-08-03 21:00:38 -07:00
|
|
|
this_cpu_write(rcu_data.rcu_urgent_qs, false);
|
|
|
|
if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
|
2018-07-10 14:00:14 -07:00
|
|
|
local_irq_save(flags);
|
|
|
|
rcu_momentary_dyntick_idle();
|
|
|
|
local_irq_restore(flags);
|
|
|
|
}
|
2018-07-11 08:09:28 -07:00
|
|
|
rcu_qs();
|
2018-07-10 14:00:14 -07:00
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rcu_all_qs);
|
|
|
|
|
2012-07-02 07:08:42 -07:00
|
|
|
/*
|
2019-10-15 21:18:14 +02:00
|
|
|
* Note a PREEMPTION=n context switch. The caller must have disabled interrupts.
|
2012-07-02 07:08:42 -07:00
|
|
|
*/
|
2018-07-02 14:30:37 -07:00
|
|
|
void rcu_note_context_switch(bool preempt)
|
2012-07-02 07:08:42 -07:00
|
|
|
{
|
2018-07-02 14:30:37 -07:00
|
|
|
trace_rcu_utilization(TPS("Start context switch"));
|
|
|
|
rcu_qs();
|
|
|
|
/* Load rcu_urgent_qs before other flags. */
|
2018-08-03 21:00:38 -07:00
|
|
|
if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs)))
|
2018-07-02 14:30:37 -07:00
|
|
|
goto out;
|
2018-08-03 21:00:38 -07:00
|
|
|
this_cpu_write(rcu_data.rcu_urgent_qs, false);
|
|
|
|
if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs)))
|
2018-07-02 14:30:37 -07:00
|
|
|
rcu_momentary_dyntick_idle();
|
2020-03-16 20:38:29 -07:00
|
|
|
rcu_tasks_qs(current, preempt);
|
2018-07-02 14:30:37 -07:00
|
|
|
out:
|
|
|
|
trace_rcu_utilization(TPS("End context switch"));
|
2012-07-02 07:08:42 -07:00
|
|
|
}
|
2018-07-02 14:30:37 -07:00
|
|
|
EXPORT_SYMBOL_GPL(rcu_note_context_switch);
|
2012-07-02 07:08:42 -07:00
|
|
|
|
2009-09-23 09:50:41 -07:00
|
|
|
/*
|
2011-03-02 13:15:15 -08:00
|
|
|
* Because preemptible RCU does not exist, there are never any preempted
|
2009-09-23 09:50:41 -07:00
|
|
|
* RCU readers.
|
|
|
|
*/
|
2011-02-07 12:47:15 -08:00
|
|
|
static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
|
2009-09-23 09:50:41 -07:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-10-31 11:22:37 -07:00
|
|
|
/*
|
|
|
|
* Because there is no preemptible RCU, there can be no readers blocked.
|
|
|
|
*/
|
|
|
|
static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-22 08:53:48 -08:00
|
|
|
{
|
2014-10-31 11:22:37 -07:00
|
|
|
return false;
|
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
When the last CPU of a given leaf rcu_node structure goes
offline, all of the tasks queued on that leaf rcu_node structure
(due to having blocked in their current RCU read-side critical
sections) are requeued onto the root rcu_node structure. This
requeuing is carried out by rcu_preempt_offline_tasks().
However, it is possible that these queued tasks are the only
thing preventing the leaf rcu_node structure from reporting a
quiescent state up the rcu_node hierarchy. Unfortunately, the
old code would fail to do this reporting, resulting in a
grace-period stall given the following sequence of events:
1. Kernel built for more than 32 CPUs on 32-bit systems or for more
than 64 CPUs on 64-bit systems, so that there is more than one
rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set
to a number smaller than CONFIG_NR_CPUS.)
2. The kernel is built with CONFIG_TREE_PREEMPT_RCU.
3. A task running on a CPU associated with a given leaf rcu_node
structure blocks while in an RCU read-side critical section
-and- that CPU has not yet passed through a quiescent state
for the current RCU grace period. This will cause the task
to be queued on the leaf rcu_node's blocked_tasks[] array, in
particular, on the element of this array corresponding to the
current grace period.
4. Each of the remaining CPUs corresponding to this same leaf rcu_node
structure pass through a quiescent state. However, the task is
still in its RCU read-side critical section, so these quiescent
states cannot be reported further up the rcu_node hierarchy.
Nevertheless, all bits in the leaf rcu_node structure's ->qsmask
field are now zero.
5. Each of the remaining CPUs go offline. (The events in step
#4 and #5 can happen in any order as long as each CPU passes
through a quiescent state before going offline.)
6. When the last CPU goes offline, __rcu_offline_cpu() will invoke
rcu_preempt_offline_tasks(), which will move the task to the
root rcu_node structure, but without reporting a quiescent state
up the rcu_node hierarchy (and this failure to report a quiescent
state is the bug).
But because this leaf rcu_node structure's ->qsmask field is
already zero and its ->block_tasks[] entries are all empty,
force_quiescent_state() will skip this rcu_node structure.
Therefore, grace periods are now hung.
This patch abstracts some code out of rcu_read_unlock_special(),
calling the result task_quiet() by analogy with cpu_quiet(), and
invokes task_quiet() from both rcu_read_lock_special() and
__rcu_offline_cpu(). Invoking task_quiet() from
__rcu_offline_cpu() reports the quiescent state up the rcu_node
hierarchy, fixing the bug. This ends up requiring a separate
lock_class_key per level of the rcu_node hierarchy, which this
patch also provides.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12589088301770-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-22 08:53:48 -08:00
|
|
|
}
|
|
|
|
|
rcu: Defer reporting RCU-preempt quiescent states when disabled
This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
2018-06-21 12:50:01 -07:00
|
|
|
/*
|
|
|
|
* Because there is no preemptible RCU, there can be no deferred quiescent
|
|
|
|
* states.
|
|
|
|
*/
|
|
|
|
static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
static void rcu_preempt_deferred_qs(struct task_struct *t) { }
|
|
|
|
|
2009-09-13 09:15:09 -07:00
|
|
|
/*
|
2011-03-02 13:15:15 -08:00
|
|
|
* Because there is no preemptible RCU, there can be no readers blocked,
|
2009-09-18 09:50:19 -07:00
|
|
|
* so there is no need to check for blocked tasks. So check only for
|
|
|
|
* bogus qsmask values.
|
2009-09-13 09:15:09 -07:00
|
|
|
*/
|
2018-07-03 17:22:34 -07:00
|
|
|
static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
|
2009-09-13 09:15:09 -07:00
|
|
|
{
|
2009-09-18 09:50:19 -07:00
|
|
|
WARN_ON_ONCE(rnp->qsmask);
|
2009-09-13 09:15:09 -07:00
|
|
|
}
|
|
|
|
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
/*
|
2018-11-21 11:35:03 -08:00
|
|
|
* Check to see if this CPU is in a non-context-switch quiescent state,
|
|
|
|
* namely user mode and idle loop.
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
*/
|
2018-11-21 11:35:03 -08:00
|
|
|
static void rcu_flavor_sched_clock_irq(int user)
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
{
|
2018-07-02 14:30:37 -07:00
|
|
|
if (user || rcu_is_cpu_rrupt_from_idle()) {
|
rcu: Merge preemptable-RCU functionality into hierarchical RCU
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:56:52 -07:00
|
|
|
|
2018-07-02 14:30:37 -07:00
|
|
|
/*
|
|
|
|
* Get here if this CPU took its interrupt from user
|
|
|
|
* mode or from the idle loop, and if this is not a
|
|
|
|
* nested interrupt. In this case, the CPU is in
|
|
|
|
* a quiescent state, so note it.
|
|
|
|
*
|
|
|
|
* No memory barrier is required here because rcu_qs()
|
|
|
|
* references only CPU-local variables that other CPUs
|
|
|
|
* neither access nor modify, at least not while the
|
|
|
|
* corresponding CPU is online.
|
|
|
|
*/
|
|
|
|
|
|
|
|
rcu_qs();
|
|
|
|
}
|
2009-10-06 21:48:17 -07:00
|
|
|
}
|
|
|
|
|
2013-04-11 10:15:52 -07:00
|
|
|
/*
|
|
|
|
* Because preemptible RCU does not exist, tasks cannot possibly exit
|
|
|
|
* while in preemptible RCU read-side critical sections.
|
|
|
|
*/
|
|
|
|
void exit_rcu(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2017-11-27 15:13:56 -08:00
|
|
|
/*
|
|
|
|
* Dump the guaranteed-empty blocked-tasks state. Trust but verify.
|
|
|
|
*/
|
2018-05-08 14:18:57 -07:00
|
|
|
static void
|
2018-07-03 17:22:34 -07:00
|
|
|
dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
|
2017-11-27 15:13:56 -08:00
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!list_empty(&rnp->blkd_tasks));
|
|
|
|
}
|
|
|
|
|
2014-09-22 14:00:48 -04:00
|
|
|
#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
|
2010-02-22 17:04:59 -08:00
|
|
|
|
2019-03-20 22:13:33 +01:00
|
|
|
/*
|
|
|
|
* If boosting, set rcuc kthreads to realtime priority.
|
|
|
|
*/
|
|
|
|
static void rcu_cpu_kthread_setup(unsigned int cpu)
|
|
|
|
{
|
2011-02-07 12:47:15 -08:00
|
|
|
#ifdef CONFIG_RCU_BOOST
|
2019-03-20 22:13:33 +01:00
|
|
|
struct sched_param sp;
|
2011-02-07 12:47:15 -08:00
|
|
|
|
2019-03-20 22:13:33 +01:00
|
|
|
sp.sched_priority = kthread_prio;
|
|
|
|
sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
|
|
|
|
#endif /* #ifdef CONFIG_RCU_BOOST */
|
2012-07-16 10:42:35 +00:00
|
|
|
}
|
|
|
|
|
2019-03-20 22:13:33 +01:00
|
|
|
#ifdef CONFIG_RCU_BOOST
|
|
|
|
|
2011-02-07 12:47:15 -08:00
|
|
|
/*
|
|
|
|
* Carry out RCU priority boosting on the task indicated by ->exp_tasks
|
|
|
|
* or ->boost_tasks, advancing the pointer to the next task in the
|
|
|
|
* ->blkd_tasks list.
|
|
|
|
*
|
|
|
|
* Note that irqs must be enabled: boosting the task can block.
|
|
|
|
* Returns 1 if there are more tasks needing to be boosted.
|
|
|
|
*/
|
|
|
|
static int rcu_boost(struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
struct task_struct *t;
|
|
|
|
struct list_head *tb;
|
|
|
|
|
2015-03-03 14:57:58 -08:00
|
|
|
if (READ_ONCE(rnp->exp_tasks) == NULL &&
|
|
|
|
READ_ONCE(rnp->boost_tasks) == NULL)
|
2011-02-07 12:47:15 -08:00
|
|
|
return 0; /* Nothing left to boost. */
|
|
|
|
|
2015-10-08 12:24:23 +02:00
|
|
|
raw_spin_lock_irqsave_rcu_node(rnp, flags);
|
2011-02-07 12:47:15 -08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Recheck under the lock: all tasks in need of boosting
|
|
|
|
* might exit their RCU read-side critical sections on their own.
|
|
|
|
*/
|
|
|
|
if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL) {
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2011-02-07 12:47:15 -08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Preferentially boost tasks blocking expedited grace periods.
|
|
|
|
* This cannot starve the normal grace periods because a second
|
|
|
|
* expedited grace period must boost all blocked tasks, including
|
|
|
|
* those blocking the pre-existing normal grace period.
|
|
|
|
*/
|
2018-01-10 12:16:42 -08:00
|
|
|
if (rnp->exp_tasks != NULL)
|
2011-02-07 12:47:15 -08:00
|
|
|
tb = rnp->exp_tasks;
|
2018-01-10 12:16:42 -08:00
|
|
|
else
|
2011-02-07 12:47:15 -08:00
|
|
|
tb = rnp->boost_tasks;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We boost task t by manufacturing an rt_mutex that appears to
|
|
|
|
* be held by task t. We leave a pointer to that rt_mutex where
|
|
|
|
* task t can find it, and task t will release the mutex when it
|
|
|
|
* exits its outermost RCU read-side critical section. Then
|
|
|
|
* simply acquiring this artificial rt_mutex will boost task
|
|
|
|
* t's priority. (Thanks to tglx for suggesting this approach!)
|
|
|
|
*
|
|
|
|
* Note that task t must acquire rnp->lock to remove itself from
|
|
|
|
* the ->blkd_tasks list, which it will do from exit() if from
|
|
|
|
* nowhere else. We therefore are guaranteed that task t will
|
|
|
|
* stay around at least until we drop rnp->lock. Note that
|
|
|
|
* rnp->lock also resolves races between our priority boosting
|
|
|
|
* and task t's exiting its outermost RCU read-side critical
|
|
|
|
* section.
|
|
|
|
*/
|
|
|
|
t = container_of(tb, struct task_struct, rcu_node_entry);
|
2014-06-12 13:30:25 -07:00
|
|
|
rt_mutex_init_proxy_locked(&rnp->boost_mtx, t);
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2014-06-12 13:30:25 -07:00
|
|
|
/* Lock only for side effect: boosts task t's priority. */
|
|
|
|
rt_mutex_lock(&rnp->boost_mtx);
|
|
|
|
rt_mutex_unlock(&rnp->boost_mtx); /* Then keep lockdep happy. */
|
2011-02-07 12:47:15 -08:00
|
|
|
|
2015-03-03 14:57:58 -08:00
|
|
|
return READ_ONCE(rnp->exp_tasks) != NULL ||
|
|
|
|
READ_ONCE(rnp->boost_tasks) != NULL;
|
2011-02-07 12:47:15 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-06-06 08:11:43 -07:00
|
|
|
* Priority-boosting kthread, one per leaf rcu_node.
|
2011-02-07 12:47:15 -08:00
|
|
|
*/
|
|
|
|
static int rcu_boost_kthread(void *arg)
|
|
|
|
{
|
|
|
|
struct rcu_node *rnp = (struct rcu_node *)arg;
|
|
|
|
int spincnt = 0;
|
|
|
|
int more2boost;
|
|
|
|
|
2013-07-12 17:18:47 -04:00
|
|
|
trace_rcu_utilization(TPS("Start boost kthread@init"));
|
2011-02-07 12:47:15 -08:00
|
|
|
for (;;) {
|
2020-01-08 20:12:59 -08:00
|
|
|
WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_WAITING);
|
2013-07-12 17:18:47 -04:00
|
|
|
trace_rcu_utilization(TPS("End boost kthread@rcu_wait"));
|
2020-01-03 15:22:01 -08:00
|
|
|
rcu_wait(READ_ONCE(rnp->boost_tasks) ||
|
|
|
|
READ_ONCE(rnp->exp_tasks));
|
2013-07-12 17:18:47 -04:00
|
|
|
trace_rcu_utilization(TPS("Start boost kthread@rcu_wait"));
|
2020-01-08 20:12:59 -08:00
|
|
|
WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_RUNNING);
|
2011-02-07 12:47:15 -08:00
|
|
|
more2boost = rcu_boost(rnp);
|
|
|
|
if (more2boost)
|
|
|
|
spincnt++;
|
|
|
|
else
|
|
|
|
spincnt = 0;
|
|
|
|
if (spincnt > 10) {
|
2020-01-08 20:12:59 -08:00
|
|
|
WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_YIELDING);
|
2013-07-12 17:18:47 -04:00
|
|
|
trace_rcu_utilization(TPS("End boost kthread@rcu_yield"));
|
2020-05-07 16:34:38 -07:00
|
|
|
schedule_timeout_idle(2);
|
2013-07-12 17:18:47 -04:00
|
|
|
trace_rcu_utilization(TPS("Start boost kthread@rcu_yield"));
|
2011-02-07 12:47:15 -08:00
|
|
|
spincnt = 0;
|
|
|
|
}
|
|
|
|
}
|
2011-05-04 21:43:49 -07:00
|
|
|
/* NOTREACHED */
|
2013-07-12 17:18:47 -04:00
|
|
|
trace_rcu_utilization(TPS("End boost kthread@notreached"));
|
2011-02-07 12:47:15 -08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check to see if it is time to start boosting RCU readers that are
|
|
|
|
* blocking the current grace period, and, if so, tell the per-rcu_node
|
|
|
|
* kthread to start boosting them. If there is an expedited grace
|
|
|
|
* period in progress, it is always time to boost.
|
|
|
|
*
|
2012-08-01 15:57:54 -07:00
|
|
|
* The caller must hold rnp->lock, which this function releases.
|
|
|
|
* The ->boost_kthread_task is immortal, so we don't need to worry
|
|
|
|
* about it going away.
|
2011-02-07 12:47:15 -08:00
|
|
|
*/
|
2011-05-04 21:43:49 -07:00
|
|
|
static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
|
2014-06-11 16:39:40 -04:00
|
|
|
__releases(rnp->lock)
|
2011-02-07 12:47:15 -08:00
|
|
|
{
|
2018-01-17 06:24:30 -08:00
|
|
|
raw_lockdep_assert_held_rcu_node(rnp);
|
2011-02-22 13:42:43 -08:00
|
|
|
if (!rcu_preempt_blocked_readers_cgp(rnp) && rnp->exp_tasks == NULL) {
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2011-02-07 12:47:15 -08:00
|
|
|
return;
|
2011-02-22 13:42:43 -08:00
|
|
|
}
|
2011-02-07 12:47:15 -08:00
|
|
|
if (rnp->exp_tasks != NULL ||
|
|
|
|
(rnp->gp_tasks != NULL &&
|
|
|
|
rnp->boost_tasks == NULL &&
|
|
|
|
rnp->qsmask == 0 &&
|
2020-04-10 15:52:53 -07:00
|
|
|
(!time_after(rnp->boost_time, jiffies) || rcu_state.cbovld))) {
|
2011-02-07 12:47:15 -08:00
|
|
|
if (rnp->exp_tasks == NULL)
|
2020-01-04 10:44:41 -08:00
|
|
|
WRITE_ONCE(rnp->boost_tasks, rnp->gp_tasks);
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2019-03-21 16:29:50 -07:00
|
|
|
rcu_wake_cond(rnp->boost_kthread_task,
|
2020-01-08 20:12:59 -08:00
|
|
|
READ_ONCE(rnp->boost_kthread_status));
|
2011-05-04 21:43:49 -07:00
|
|
|
} else {
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2011-05-04 21:43:49 -07:00
|
|
|
}
|
2011-02-07 12:47:15 -08:00
|
|
|
}
|
|
|
|
|
2011-11-29 15:57:13 -08:00
|
|
|
/*
|
|
|
|
* Is the current CPU running the RCU-callbacks kthread?
|
|
|
|
* Caller must have preemption disabled.
|
|
|
|
*/
|
|
|
|
static bool rcu_is_callbacks_kthread(void)
|
|
|
|
{
|
2018-11-30 16:11:14 -08:00
|
|
|
return __this_cpu_read(rcu_data.rcu_cpu_kthread_task) == current;
|
2011-11-29 15:57:13 -08:00
|
|
|
}
|
|
|
|
|
2011-02-07 12:47:15 -08:00
|
|
|
#define RCU_BOOST_DELAY_JIFFIES DIV_ROUND_UP(CONFIG_RCU_BOOST_DELAY * HZ, 1000)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do priority-boost accounting for the start of a new grace period.
|
|
|
|
*/
|
|
|
|
static void rcu_preempt_boost_start_gp(struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
rnp->boost_time = jiffies + RCU_BOOST_DELAY_JIFFIES;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create an RCU-boost kthread for the specified node if one does not
|
|
|
|
* already exist. We only create this kthread for preemptible RCU.
|
|
|
|
* Returns zero if all is well, a negated errno otherwise.
|
|
|
|
*/
|
2019-07-01 09:40:39 +09:00
|
|
|
static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
|
2011-02-07 12:47:15 -08:00
|
|
|
{
|
2018-07-03 17:22:34 -07:00
|
|
|
int rnp_index = rnp - rcu_get_root();
|
2011-02-07 12:47:15 -08:00
|
|
|
unsigned long flags;
|
|
|
|
struct sched_param sp;
|
|
|
|
struct task_struct *t;
|
|
|
|
|
2018-07-03 17:22:34 -07:00
|
|
|
if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
|
2019-07-01 09:40:39 +09:00
|
|
|
return;
|
2012-07-16 10:42:35 +00:00
|
|
|
|
rcu: Process offlining and onlining only at grace-period start
Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events. Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.
This commit avoids these problems by buffering online and offline events
in a new ->qsmaskinitnext field in the leaf rcu_node structures. When a
grace period starts, the events accumulated in this mask are applied to
the ->qsmaskinit field, and, if needed, up the rcu_node tree. The special
case of all CPUs corresponding to a given leaf rcu_node structure being
offline while there are still elements in that structure's ->blkd_tasks
list is handled using a new ->wait_blkd_tasks field. In this case,
propagating the offline bits up the tree is deferred until the beginning
of the grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared. If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.
This of course means that RCU's notion of which CPUs are offline can be
out of date. This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started. In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-23 21:52:37 -08:00
|
|
|
if (!rcu_scheduler_fully_active || rcu_rnp_online_cpus(rnp) == 0)
|
2019-07-01 09:40:39 +09:00
|
|
|
return;
|
2012-07-16 10:42:35 +00:00
|
|
|
|
2018-07-03 17:22:34 -07:00
|
|
|
rcu_state.boost = 1;
|
2019-07-01 09:40:39 +09:00
|
|
|
|
2011-02-07 12:47:15 -08:00
|
|
|
if (rnp->boost_kthread_task != NULL)
|
2019-07-01 09:40:39 +09:00
|
|
|
return;
|
|
|
|
|
2011-02-07 12:47:15 -08:00
|
|
|
t = kthread_create(rcu_boost_kthread, (void *)rnp,
|
2011-08-19 11:39:11 -07:00
|
|
|
"rcub/%d", rnp_index);
|
2019-07-01 09:40:39 +09:00
|
|
|
if (WARN_ON_ONCE(IS_ERR(t)))
|
|
|
|
return;
|
|
|
|
|
2015-10-08 12:24:23 +02:00
|
|
|
raw_spin_lock_irqsave_rcu_node(rnp, flags);
|
2011-02-07 12:47:15 -08:00
|
|
|
rnp->boost_kthread_task = t;
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2014-09-12 21:21:09 -05:00
|
|
|
sp.sched_priority = kthread_prio;
|
2011-02-07 12:47:15 -08:00
|
|
|
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
|
2011-05-30 20:38:55 -07:00
|
|
|
wake_up_process(t); /* get to TASK_INTERRUPTIBLE quickly. */
|
2011-02-07 12:47:15 -08:00
|
|
|
}
|
|
|
|
|
2011-06-16 08:26:32 -07:00
|
|
|
/*
|
|
|
|
* Set the per-rcu_node kthread's affinity to cover all CPUs that are
|
|
|
|
* served by the rcu_node in question. The CPU hotplug lock is still
|
|
|
|
* held, so the value of rnp->qsmaskinit will be stable.
|
|
|
|
*
|
|
|
|
* We don't include outgoingcpu in the affinity set, use -1 if there is
|
|
|
|
* no outgoing CPU. If there are no CPUs left in the affinity set,
|
|
|
|
* this function allows the kthread to execute on any CPU.
|
|
|
|
*/
|
2012-07-16 10:42:35 +00:00
|
|
|
static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
|
2011-06-16 08:26:32 -07:00
|
|
|
{
|
2012-07-16 10:42:35 +00:00
|
|
|
struct task_struct *t = rnp->boost_kthread_task;
|
rcu: Process offlining and onlining only at grace-period start
Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events. Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.
This commit avoids these problems by buffering online and offline events
in a new ->qsmaskinitnext field in the leaf rcu_node structures. When a
grace period starts, the events accumulated in this mask are applied to
the ->qsmaskinit field, and, if needed, up the rcu_node tree. The special
case of all CPUs corresponding to a given leaf rcu_node structure being
offline while there are still elements in that structure's ->blkd_tasks
list is handled using a new ->wait_blkd_tasks field. In this case,
propagating the offline bits up the tree is deferred until the beginning
of the grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared. If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.
This of course means that RCU's notion of which CPUs are offline can be
out of date. This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started. In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-23 21:52:37 -08:00
|
|
|
unsigned long mask = rcu_rnp_online_cpus(rnp);
|
2011-06-16 08:26:32 -07:00
|
|
|
cpumask_var_t cm;
|
|
|
|
int cpu;
|
|
|
|
|
2012-07-16 10:42:35 +00:00
|
|
|
if (!t)
|
2011-06-16 08:26:32 -07:00
|
|
|
return;
|
2012-07-16 10:42:35 +00:00
|
|
|
if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
|
2011-06-16 08:26:32 -07:00
|
|
|
return;
|
rcu: Correctly handle sparse possible cpus
In many cases in the RCU tree code, we iterate over the set of cpus for
a leaf node described by rcu_node::grplo and rcu_node::grphi, checking
per-cpu data for each cpu in this range. However, if the set of possible
cpus is sparse, some cpus described in this range are not possible, and
thus no per-cpu region will have been allocated (or initialised) for
them by the generic percpu code.
Erroneous accesses to a per-cpu area for these !possible cpus may fault
or may hit other data depending on the addressed generated when the
erroneous per cpu offset is applied. In practice, both cases have been
observed on arm64 hardware (the former being silent, but detectable with
additional patches).
To avoid issues resulting from this, we must iterate over the set of
*possible* cpus for a given leaf node. This patch add a new helper,
for_each_leaf_node_possible_cpu, to enable this. As iteration is often
intertwined with rcu_node local bitmask manipulation, a new
leaf_node_cpu_bit helper is added to make this simpler and more
consistent. The RCU tree code is made to use both of these where
appropriate.
Without this patch, running reboot at a shell can result in an oops
like:
[ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c
[ 3369.083881] pgd = ffffffc3ecdda000
[ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000
[ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[ 3369.101781] Modules linked in:
[ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G W 4.6.0+ #3
[ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000
[ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510
[ 3369.139479] pc : [<ffffff80081109a8>] lr : [<ffffff8008110924>] pstate: 200001c5
[ 3369.146860] sp : ffffffc3eb9435a0
[ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88
[ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600
[ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88
[ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80
[ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40
[ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000
[ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0
[ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000
[ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000
[ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78
[ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000
[ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003
[ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280
[ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001
[ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140
...
[ 3369.972257] [<ffffff80081109a8>] sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.978685] [<ffffff80081128b4>] synchronize_rcu_expedited+0x64/0xa8
[ 3369.985026] [<ffffff80086b987c>] synchronize_net+0x24/0x30
[ 3369.990499] [<ffffff80086ddb54>] dev_deactivate_many+0x28c/0x298
[ 3369.996493] [<ffffff80086b6bb8>] __dev_close_many+0x60/0xd0
[ 3370.002052] [<ffffff80086b6d48>] __dev_close+0x28/0x40
[ 3370.007178] [<ffffff80086bf62c>] __dev_change_flags+0x8c/0x158
[ 3370.012999] [<ffffff80086bf718>] dev_change_flags+0x20/0x60
[ 3370.018558] [<ffffff80086cf7f0>] do_setlink+0x288/0x918
[ 3370.023771] [<ffffff80086d0798>] rtnl_newlink+0x398/0x6a8
[ 3370.029158] [<ffffff80086cee84>] rtnetlink_rcv_msg+0xe4/0x220
[ 3370.034891] [<ffffff80086e274c>] netlink_rcv_skb+0xc4/0xf8
[ 3370.040364] [<ffffff80086ced8c>] rtnetlink_rcv+0x2c/0x40
[ 3370.045663] [<ffffff80086e1fe8>] netlink_unicast+0x160/0x238
[ 3370.051309] [<ffffff80086e24b8>] netlink_sendmsg+0x2f0/0x358
[ 3370.056956] [<ffffff80086a0070>] sock_sendmsg+0x18/0x30
[ 3370.062168] [<ffffff80086a21cc>] ___sys_sendmsg+0x26c/0x280
[ 3370.067728] [<ffffff80086a30ac>] __sys_sendmsg+0x44/0x88
[ 3370.073027] [<ffffff80086a3100>] SyS_sendmsg+0x10/0x20
[ 3370.078153] [<ffffff8008085e70>] el0_svc_naked+0x24/0x28
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Dennis Chen <dennis.chen@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-06-03 15:20:04 +01:00
|
|
|
for_each_leaf_node_possible_cpu(rnp, cpu)
|
|
|
|
if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
|
|
|
|
cpu != outgoingcpu)
|
2011-06-16 08:26:32 -07:00
|
|
|
cpumask_set_cpu(cpu, cm);
|
2014-11-10 08:07:08 -08:00
|
|
|
if (cpumask_weight(cm) == 0)
|
2011-06-16 08:26:32 -07:00
|
|
|
cpumask_setall(cm);
|
2012-07-16 10:42:35 +00:00
|
|
|
set_cpus_allowed_ptr(t, cm);
|
2011-06-16 08:26:32 -07:00
|
|
|
free_cpumask_var(cm);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2014-07-13 12:00:53 -07:00
|
|
|
* Spawn boost kthreads -- called as soon as the scheduler is running.
|
2011-06-16 08:26:32 -07:00
|
|
|
*/
|
2014-07-13 12:00:53 -07:00
|
|
|
static void __init rcu_spawn_boost_kthreads(void)
|
2011-06-16 08:26:32 -07:00
|
|
|
{
|
|
|
|
struct rcu_node *rnp;
|
|
|
|
|
2018-07-04 14:33:59 -07:00
|
|
|
rcu_for_each_leaf_node(rnp)
|
2019-07-01 09:40:39 +09:00
|
|
|
rcu_spawn_one_boost_kthread(rnp);
|
2011-06-16 08:26:32 -07:00
|
|
|
}
|
|
|
|
|
2013-06-19 14:52:21 -04:00
|
|
|
static void rcu_prepare_kthreads(int cpu)
|
2011-06-16 08:26:32 -07:00
|
|
|
{
|
2018-07-03 15:37:16 -07:00
|
|
|
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
|
2011-06-16 08:26:32 -07:00
|
|
|
struct rcu_node *rnp = rdp->mynode;
|
|
|
|
|
|
|
|
/* Fire up the incoming CPU's kthread and leaf rcu_node kthread. */
|
2012-07-16 10:42:38 +00:00
|
|
|
if (rcu_scheduler_fully_active)
|
2019-07-01 09:40:39 +09:00
|
|
|
rcu_spawn_one_boost_kthread(rnp);
|
2011-06-16 08:26:32 -07:00
|
|
|
}
|
|
|
|
|
2011-02-07 12:47:15 -08:00
|
|
|
#else /* #ifdef CONFIG_RCU_BOOST */
|
|
|
|
|
2011-05-04 21:43:49 -07:00
|
|
|
static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
|
2014-06-11 16:39:40 -04:00
|
|
|
__releases(rnp->lock)
|
2011-02-07 12:47:15 -08:00
|
|
|
{
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
2011-02-07 12:47:15 -08:00
|
|
|
}
|
|
|
|
|
2011-11-29 15:57:13 -08:00
|
|
|
static bool rcu_is_callbacks_kthread(void)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2011-02-07 12:47:15 -08:00
|
|
|
static void rcu_preempt_boost_start_gp(struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2012-07-16 10:42:35 +00:00
|
|
|
static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
|
2011-06-16 08:26:32 -07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2014-07-13 12:00:53 -07:00
|
|
|
static void __init rcu_spawn_boost_kthreads(void)
|
2011-07-10 15:57:35 -07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2013-06-19 14:52:21 -04:00
|
|
|
static void rcu_prepare_kthreads(int cpu)
|
2011-06-16 08:26:32 -07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-02-07 12:47:15 -08:00
|
|
|
#endif /* #else #ifdef CONFIG_RCU_BOOST */
|
|
|
|
|
2010-02-22 17:04:59 -08:00
|
|
|
#if !defined(CONFIG_RCU_FAST_NO_HZ)
|
|
|
|
|
|
|
|
/*
|
2019-08-12 10:28:08 -07:00
|
|
|
* Check to see if any future non-offloaded RCU-related work will need
|
|
|
|
* to be done by the current CPU, even if none need be done immediately,
|
|
|
|
* returning 1 if so. This function is part of the RCU implementation;
|
|
|
|
* it is -not- an exported member of the RCU API.
|
2010-02-22 17:04:59 -08:00
|
|
|
*
|
2018-07-07 18:12:26 -07:00
|
|
|
* Because we not have RCU_FAST_NO_HZ, just check whether or not this
|
|
|
|
* CPU has RCU callbacks queued.
|
2010-02-22 17:04:59 -08:00
|
|
|
*/
|
2015-04-14 21:08:58 +00:00
|
|
|
int rcu_needs_cpu(u64 basemono, u64 *nextevt)
|
2010-02-22 17:04:59 -08:00
|
|
|
{
|
2015-04-14 21:08:58 +00:00
|
|
|
*nextevt = KTIME_MAX;
|
2019-08-12 10:28:08 -07:00
|
|
|
return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
|
2020-11-12 01:51:21 +01:00
|
|
|
!rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
|
rcu: Permit dyntick-idle with callbacks pending
The current implementation of RCU_FAST_NO_HZ prevents CPUs from entering
dyntick-idle state if they have RCU callbacks pending. Unfortunately,
this has the side-effect of often preventing them from entering this
state, especially if at least one other CPU is not in dyntick-idle state.
However, the resulting per-tick wakeup is wasteful in many cases: if the
CPU has already fully responded to the current RCU grace period, there
will be nothing for it to do until this grace period ends, which will
frequently take several jiffies.
This commit therefore permits a CPU that has done everything that the
current grace period has asked of it (rcu_pending() == 0) even if it
still as RCU callbacks pending. However, such a CPU posts a timer to
wake it up several jiffies later (6 jiffies, based on experience with
grace-period lengths). This wakeup is required to handle situations
that can result in all CPUs being in dyntick-idle mode, thus failing
to ever complete the current grace period. If a CPU wakes up before
the timer goes off, then it cancels that timer, thus avoiding spurious
wakeups.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-11-28 12:28:34 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Because we do not have RCU_FAST_NO_HZ, don't bother cleaning up
|
|
|
|
* after it.
|
|
|
|
*/
|
2014-10-22 15:07:37 -07:00
|
|
|
static void rcu_cleanup_after_idle(void)
|
rcu: Permit dyntick-idle with callbacks pending
The current implementation of RCU_FAST_NO_HZ prevents CPUs from entering
dyntick-idle state if they have RCU callbacks pending. Unfortunately,
this has the side-effect of often preventing them from entering this
state, especially if at least one other CPU is not in dyntick-idle state.
However, the resulting per-tick wakeup is wasteful in many cases: if the
CPU has already fully responded to the current RCU grace period, there
will be nothing for it to do until this grace period ends, which will
frequently take several jiffies.
This commit therefore permits a CPU that has done everything that the
current grace period has asked of it (rcu_pending() == 0) even if it
still as RCU callbacks pending. However, such a CPU posts a timer to
wake it up several jiffies later (6 jiffies, based on experience with
grace-period lengths). This wakeup is required to handle situations
that can result in all CPUs being in dyntick-idle mode, thus failing
to ever complete the current grace period. If a CPU wakes up before
the timer goes off, then it cancels that timer, thus avoiding spurious
wakeups.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-11-28 12:28:34 -08:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2011-11-02 06:54:54 -07:00
|
|
|
/*
|
2012-01-16 13:29:10 -08:00
|
|
|
* Do the idle-entry grace-period work, which, because CONFIG_RCU_FAST_NO_HZ=n,
|
2011-11-02 06:54:54 -07:00
|
|
|
* is nothing.
|
|
|
|
*/
|
2014-10-22 15:03:43 -07:00
|
|
|
static void rcu_prepare_for_idle(void)
|
2011-11-02 06:54:54 -07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2010-02-22 17:04:59 -08:00
|
|
|
#else /* #if !defined(CONFIG_RCU_FAST_NO_HZ) */
|
|
|
|
|
2011-11-30 15:41:14 -08:00
|
|
|
/*
|
|
|
|
* This code is invoked when a CPU goes idle, at which point we want
|
|
|
|
* to have the CPU do everything required for RCU so that it can enter
|
2019-08-30 12:36:32 -04:00
|
|
|
* the energy-efficient dyntick-idle mode.
|
2011-11-30 15:41:14 -08:00
|
|
|
*
|
2019-08-30 12:36:32 -04:00
|
|
|
* The following preprocessor symbol controls this:
|
2011-11-30 15:41:14 -08:00
|
|
|
*
|
|
|
|
* RCU_IDLE_GP_DELAY gives the number of jiffies that a CPU is permitted
|
|
|
|
* to sleep in dyntick-idle mode with RCU callbacks pending. This
|
|
|
|
* is sized to be roughly one RCU grace period. Those energy-efficiency
|
|
|
|
* benchmarkers who might otherwise be tempted to set this to a large
|
|
|
|
* number, be warned: Setting RCU_IDLE_GP_DELAY too high can hang your
|
|
|
|
* system. And if you are -that- concerned about energy efficiency,
|
|
|
|
* just power the system down and be done with it!
|
|
|
|
*
|
2019-08-30 12:36:32 -04:00
|
|
|
* The value below works well in practice. If future workloads require
|
2011-11-30 15:41:14 -08:00
|
|
|
* adjustment, they can be converted into kernel config parameters, though
|
|
|
|
* making the state machine smarter might be a better option.
|
|
|
|
*/
|
2012-06-04 20:45:10 -07:00
|
|
|
#define RCU_IDLE_GP_DELAY 4 /* Roughly one grace period. */
|
2011-11-30 15:41:14 -08:00
|
|
|
|
2012-12-12 12:35:29 -08:00
|
|
|
static int rcu_idle_gp_delay = RCU_IDLE_GP_DELAY;
|
|
|
|
module_param(rcu_idle_gp_delay, int, 0644);
|
2012-01-06 14:11:30 -08:00
|
|
|
|
|
|
|
/*
|
2018-07-07 18:12:26 -07:00
|
|
|
* Try to advance callbacks on the current CPU, but only if it has been
|
|
|
|
* awhile since the last time we did so. Afterwards, if there are any
|
|
|
|
* callbacks ready for immediate invocation, return true.
|
2012-01-06 14:11:30 -08:00
|
|
|
*/
|
2013-11-17 21:08:07 -08:00
|
|
|
static bool __maybe_unused rcu_try_advance_all_cbs(void)
|
2012-01-06 14:11:30 -08:00
|
|
|
{
|
2012-12-28 11:30:36 -08:00
|
|
|
bool cbs_ready = false;
|
2018-08-03 21:00:38 -07:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
2012-12-28 11:30:36 -08:00
|
|
|
struct rcu_node *rnp;
|
2012-01-06 14:11:30 -08:00
|
|
|
|
2013-08-25 21:20:47 -07:00
|
|
|
/* Exit early if we advanced recently. */
|
2018-08-03 21:00:38 -07:00
|
|
|
if (jiffies == rdp->last_advance_all)
|
2014-07-08 18:26:13 -04:00
|
|
|
return false;
|
2018-08-03 21:00:38 -07:00
|
|
|
rdp->last_advance_all = jiffies;
|
2013-08-25 21:20:47 -07:00
|
|
|
|
2018-07-04 15:35:00 -07:00
|
|
|
rnp = rdp->mynode;
|
2012-01-06 14:11:30 -08:00
|
|
|
|
2018-07-04 15:35:00 -07:00
|
|
|
/*
|
|
|
|
* Don't bother checking unless a grace period has
|
|
|
|
* completed since we last checked and there are
|
|
|
|
* callbacks not yet ready to invoke.
|
|
|
|
*/
|
|
|
|
if ((rcu_seq_completed_gp(rdp->gp_seq,
|
|
|
|
rcu_seq_current(&rnp->gp_seq)) ||
|
|
|
|
unlikely(READ_ONCE(rdp->gpwrap))) &&
|
|
|
|
rcu_segcblist_pend_cbs(&rdp->cblist))
|
|
|
|
note_gp_changes(rdp);
|
|
|
|
|
|
|
|
if (rcu_segcblist_ready_cbs(&rdp->cblist))
|
|
|
|
cbs_ready = true;
|
2012-12-28 11:30:36 -08:00
|
|
|
return cbs_ready;
|
2012-01-06 14:11:30 -08:00
|
|
|
}
|
|
|
|
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-10 16:41:44 -07:00
|
|
|
/*
|
2012-12-28 11:30:36 -08:00
|
|
|
* Allow the CPU to enter dyntick-idle mode unless it has callbacks ready
|
|
|
|
* to invoke. If the CPU has callbacks, try to advance them. Tell the
|
2019-08-30 12:36:32 -04:00
|
|
|
* caller about what to set the timeout.
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-10 16:41:44 -07:00
|
|
|
*
|
2012-12-28 11:30:36 -08:00
|
|
|
* The caller must have disabled interrupts.
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-10 16:41:44 -07:00
|
|
|
*/
|
2015-04-14 21:08:58 +00:00
|
|
|
int rcu_needs_cpu(u64 basemono, u64 *nextevt)
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-10 16:41:44 -07:00
|
|
|
{
|
2018-08-03 21:00:38 -07:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
2015-04-14 21:08:58 +00:00
|
|
|
unsigned long dj;
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-10 16:41:44 -07:00
|
|
|
|
2017-11-06 16:01:30 +01:00
|
|
|
lockdep_assert_irqs_disabled();
|
2015-03-04 15:41:24 -08:00
|
|
|
|
2019-08-12 10:28:08 -07:00
|
|
|
/* If no non-offloaded callbacks, RCU doesn't need the CPU. */
|
|
|
|
if (rcu_segcblist_empty(&rdp->cblist) ||
|
2020-11-12 01:51:21 +01:00
|
|
|
rcu_rdp_is_offloaded(rdp)) {
|
2015-04-14 21:08:58 +00:00
|
|
|
*nextevt = KTIME_MAX;
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-10 16:41:44 -07:00
|
|
|
return 0;
|
|
|
|
}
|
2012-12-28 11:30:36 -08:00
|
|
|
|
|
|
|
/* Attempt to advance callbacks. */
|
|
|
|
if (rcu_try_advance_all_cbs()) {
|
|
|
|
/* Some ready to invoke, so initiate later invocation. */
|
|
|
|
invoke_rcu_core();
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-10 16:41:44 -07:00
|
|
|
return 1;
|
|
|
|
}
|
2018-08-03 21:00:38 -07:00
|
|
|
rdp->last_accelerate = jiffies;
|
2012-12-28 11:30:36 -08:00
|
|
|
|
2019-08-30 12:36:32 -04:00
|
|
|
/* Request timer and round. */
|
|
|
|
dj = round_up(rcu_idle_gp_delay + jiffies, rcu_idle_gp_delay) - jiffies;
|
|
|
|
|
2015-04-14 21:08:58 +00:00
|
|
|
*nextevt = basemono + dj * TICK_NSEC;
|
rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-10 16:41:44 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-04-30 14:16:19 -07:00
|
|
|
/*
|
2019-08-30 12:36:32 -04:00
|
|
|
* Prepare a CPU for idle from an RCU perspective. The first major task is to
|
|
|
|
* sense whether nohz mode has been enabled or disabled via sysfs. The second
|
|
|
|
* major task is to accelerate (that is, assign grace-period numbers to) any
|
|
|
|
* recently arrived callbacks.
|
2011-11-02 06:54:54 -07:00
|
|
|
*
|
|
|
|
* The caller must have disabled interrupts.
|
2010-02-22 17:04:59 -08:00
|
|
|
*/
|
2014-10-22 15:03:43 -07:00
|
|
|
static void rcu_prepare_for_idle(void)
|
2010-02-22 17:04:59 -08:00
|
|
|
{
|
rcu: Make callers awaken grace-period kthread
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread. This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.
Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off. This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.
However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs. In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress. Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling. In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls. Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.
In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.
This commit therefore makes this change. The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-03-11 13:02:16 -07:00
|
|
|
bool needwake;
|
2018-08-03 21:00:38 -07:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
2012-12-28 11:30:36 -08:00
|
|
|
struct rcu_node *rnp;
|
2012-06-24 10:15:02 -07:00
|
|
|
int tne;
|
|
|
|
|
2017-11-06 16:01:30 +01:00
|
|
|
lockdep_assert_irqs_disabled();
|
2020-11-12 01:51:21 +01:00
|
|
|
if (rcu_rdp_is_offloaded(rdp))
|
2015-03-04 15:41:24 -08:00
|
|
|
return;
|
|
|
|
|
2012-06-24 10:15:02 -07:00
|
|
|
/* Handle nohz enablement switches conservatively. */
|
2015-03-03 14:57:58 -08:00
|
|
|
tne = READ_ONCE(tick_nohz_active);
|
2018-08-03 21:00:38 -07:00
|
|
|
if (tne != rdp->tick_nohz_enabled_snap) {
|
2018-11-29 13:28:49 -08:00
|
|
|
if (!rcu_segcblist_empty(&rdp->cblist))
|
2012-06-24 10:15:02 -07:00
|
|
|
invoke_rcu_core(); /* force nohz to see update. */
|
2018-08-03 21:00:38 -07:00
|
|
|
rdp->tick_nohz_enabled_snap = tne;
|
2012-06-24 10:15:02 -07:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (!tne)
|
|
|
|
return;
|
2012-03-15 12:16:26 -07:00
|
|
|
|
2011-11-22 17:07:11 -08:00
|
|
|
/*
|
2012-12-28 11:30:36 -08:00
|
|
|
* If we have not yet accelerated this jiffy, accelerate all
|
|
|
|
* callbacks on this CPU.
|
2011-11-22 17:07:11 -08:00
|
|
|
*/
|
2018-08-03 21:00:38 -07:00
|
|
|
if (rdp->last_accelerate == jiffies)
|
2011-11-02 06:54:54 -07:00
|
|
|
return;
|
2018-08-03 21:00:38 -07:00
|
|
|
rdp->last_accelerate = jiffies;
|
2018-07-04 15:35:00 -07:00
|
|
|
if (rcu_segcblist_pend_cbs(&rdp->cblist)) {
|
2012-12-28 11:30:36 -08:00
|
|
|
rnp = rdp->mynode;
|
2015-10-08 12:24:23 +02:00
|
|
|
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
|
2018-07-03 17:22:34 -07:00
|
|
|
needwake = rcu_accelerate_cbs(rnp, rdp);
|
2015-12-29 12:18:47 +08:00
|
|
|
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
|
rcu: Make callers awaken grace-period kthread
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread. This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.
Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off. This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.
However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs. In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress. Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling. In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls. Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.
In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.
This commit therefore makes this change. The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-03-11 13:02:16 -07:00
|
|
|
if (needwake)
|
2018-07-03 17:22:34 -07:00
|
|
|
rcu_gp_kthread_wake();
|
2010-04-25 21:04:29 -07:00
|
|
|
}
|
2012-12-28 11:30:36 -08:00
|
|
|
}
|
2011-11-22 17:07:11 -08:00
|
|
|
|
2012-12-28 11:30:36 -08:00
|
|
|
/*
|
|
|
|
* Clean up for exit from idle. Attempt to advance callbacks based on
|
|
|
|
* any grace periods that elapsed while the CPU was idle, and if any
|
|
|
|
* callbacks are now ready to invoke, initiate invocation.
|
|
|
|
*/
|
2014-10-22 15:07:37 -07:00
|
|
|
static void rcu_cleanup_after_idle(void)
|
2012-12-28 11:30:36 -08:00
|
|
|
{
|
2019-04-12 15:58:34 -07:00
|
|
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
|
|
|
|
2017-11-06 16:01:30 +01:00
|
|
|
lockdep_assert_irqs_disabled();
|
2020-11-12 01:51:21 +01:00
|
|
|
if (rcu_rdp_is_offloaded(rdp))
|
2011-11-02 06:54:54 -07:00
|
|
|
return;
|
2013-08-22 18:16:16 -07:00
|
|
|
if (rcu_try_advance_all_cbs())
|
|
|
|
invoke_rcu_core();
|
2010-02-22 17:04:59 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */
|
2012-01-16 13:29:10 -08:00
|
|
|
|
2012-08-19 21:35:53 -07:00
|
|
|
#ifdef CONFIG_RCU_NOCB_CPU
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Offload callback processing from the boot-time-specified set of CPUs
|
2018-12-03 14:07:17 -08:00
|
|
|
* specified by rcu_nocb_mask. For the CPUs in the set, there are kthreads
|
|
|
|
* created that pull the callbacks from the corresponding CPU, wait for
|
|
|
|
* a grace period to elapse, and invoke the callbacks. These kthreads
|
2019-03-28 15:44:18 -07:00
|
|
|
* are organized into GP kthreads, which manage incoming callbacks, wait for
|
|
|
|
* grace periods, and awaken CB kthreads, and the CB kthreads, which only
|
|
|
|
* invoke callbacks. Each GP kthread invokes its own CBs. The no-CBs CPUs
|
|
|
|
* do a wake_up() on their GP kthread when they insert a callback into any
|
2018-12-03 14:07:17 -08:00
|
|
|
* empty list, unless the rcu_nocb_poll boot parameter has been specified,
|
|
|
|
* in which case each kthread actively polls its CPU. (Which isn't so great
|
|
|
|
* for energy efficiency, but which does reduce RCU's overhead on that CPU.)
|
2012-08-19 21:35:53 -07:00
|
|
|
*
|
|
|
|
* This is intended to be used in conjunction with Frederic Weisbecker's
|
|
|
|
* adaptive-idle work, which would seriously reduce OS jitter on CPUs
|
|
|
|
* running CPU-bound user-mode computations.
|
|
|
|
*
|
2018-12-03 14:07:17 -08:00
|
|
|
* Offloading of callbacks can also be used as an energy-efficiency
|
|
|
|
* measure because CPUs with no RCU callbacks queued are more aggressive
|
|
|
|
* about entering dyntick-idle mode.
|
2012-08-19 21:35:53 -07:00
|
|
|
*/
|
|
|
|
|
|
|
|
|
2019-03-06 14:47:56 -08:00
|
|
|
/*
|
|
|
|
* Parse the boot-time rcu_nocb_mask CPU list from the kernel parameters.
|
|
|
|
* The string after the "rcu_nocbs=" is either "all" for all CPUs, or a
|
|
|
|
* comma-separated list of CPUs and/or CPU ranges. If an invalid list is
|
|
|
|
* given, a warning is emitted and all CPUs are offloaded.
|
|
|
|
*/
|
2012-08-19 21:35:53 -07:00
|
|
|
static int __init rcu_nocb_setup(char *str)
|
|
|
|
{
|
|
|
|
alloc_bootmem_cpumask_var(&rcu_nocb_mask);
|
rcu: Allow rcu_nocbs= to specify all CPUs
Currently, the rcu_nocbs= kernel boot parameter requires that a specific
list of CPUs be specified, and has no way to say "all of them".
As noted by user RavFX in a comment to Phoronix topic 1002538, this
is an inconvenient side effect of the removal of the RCU_NOCB_CPU_ALL
Kconfig option. This commit therefore enables the rcu_nocbs= kernel boot
parameter to be given the string "all", as in "rcu_nocbs=all" to specify
that all CPUs on the system are to have their RCU callbacks offloaded.
Another approach would be to make cpulist_parse() check for "all", but
there are uses of cpulist_parse() that do other checking, which could
conflict with an "all". This commit therefore focuses on the specific
use of cpulist_parse() in rcu_nocb_setup().
Just a note to other people who would like changes to Linux-kernel RCU:
If you send your requests to me directly, they might get fixed somewhat
faster. RavFX's comment was posted on January 22, 2018 and I first saw
it on March 5, 2019. And the only reason that I found it -at- -all- was
that I was looking for projects using RCU, and my search engine showed
me that Phoronix comment quite by accident. Your choice, though! ;-)
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-05 15:28:19 -08:00
|
|
|
if (!strcasecmp(str, "all"))
|
|
|
|
cpumask_setall(rcu_nocb_mask);
|
|
|
|
else
|
2019-03-06 14:47:56 -08:00
|
|
|
if (cpulist_parse(str, rcu_nocb_mask)) {
|
|
|
|
pr_warn("rcu_nocbs= bad CPU range, all CPUs set\n");
|
|
|
|
cpumask_setall(rcu_nocb_mask);
|
|
|
|
}
|
2012-08-19 21:35:53 -07:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
__setup("rcu_nocbs=", rcu_nocb_setup);
|
|
|
|
|
rcu: Make rcu_nocb_poll an early_param instead of module_param
The as-documented rcu_nocb_poll will fail to enable this feature
for two reasons. (1) there is an extra "s" in the documented
name which is not in the code, and (2) since it uses module_param,
it really is expecting a prefix, akin to "rcutree.fanout_leaf"
and the prefix isn't documented.
However, there are several reasons why we might not want to
simply fix the typo and add the prefix:
1) we'd end up with rcutree.rcu_nocb_poll, and rather probably make
a change to rcutree.nocb_poll
2) if we did #1, then the prefix wouldn't be consistent with the
rcu_nocbs=<cpumap> parameter (i.e. one with, one without prefix)
3) the use of module_param in a header file is less than desired,
since it isn't immediately obvious that it will get processed
via rcutree.c and get the prefix from that (although use of
module_param_named() could clarify that.)
4) the implied export of /sys/module/rcutree/parameters/rcu_nocb_poll
data to userspace via module_param() doesn't really buy us anything,
as it is read-only and we can tell if it is enabled already without
it, since there is a printk at early boot telling us so.
In light of all that, just change it from a module_param() to an
early_setup() call, and worry about adding it to /sys later on if
we decide to allow a dynamic setting of it.
Also change the variable to be tagged as read_mostly, since it
will only ever be fiddled with at most, once at boot.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-12-20 13:19:22 -08:00
|
|
|
static int __init parse_rcu_nocb_poll(char *arg)
|
|
|
|
{
|
2017-03-25 20:46:02 +01:00
|
|
|
rcu_nocb_poll = true;
|
rcu: Make rcu_nocb_poll an early_param instead of module_param
The as-documented rcu_nocb_poll will fail to enable this feature
for two reasons. (1) there is an extra "s" in the documented
name which is not in the code, and (2) since it uses module_param,
it really is expecting a prefix, akin to "rcutree.fanout_leaf"
and the prefix isn't documented.
However, there are several reasons why we might not want to
simply fix the typo and add the prefix:
1) we'd end up with rcutree.rcu_nocb_poll, and rather probably make
a change to rcutree.nocb_poll
2) if we did #1, then the prefix wouldn't be consistent with the
rcu_nocbs=<cpumap> parameter (i.e. one with, one without prefix)
3) the use of module_param in a header file is less than desired,
since it isn't immediately obvious that it will get processed
via rcutree.c and get the prefix from that (although use of
module_param_named() could clarify that.)
4) the implied export of /sys/module/rcutree/parameters/rcu_nocb_poll
data to userspace via module_param() doesn't really buy us anything,
as it is read-only and we can tell if it is enabled already without
it, since there is a printk at early boot telling us so.
In light of all that, just change it from a module_param() to an
early_setup() call, and worry about adding it to /sys later on if
we decide to allow a dynamic setting of it.
Also change the variable to be tagged as read_mostly, since it
will only ever be fiddled with at most, once at boot.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-12-20 13:19:22 -08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
early_param("rcu_nocb_poll", parse_rcu_nocb_poll);
|
|
|
|
|
2019-05-15 09:56:40 -07:00
|
|
|
/*
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
* Don't bother bypassing ->cblist if the call_rcu() rate is low.
|
|
|
|
* After all, the main point of bypassing is to avoid lock contention
|
|
|
|
* on ->nocb_lock, which only can happen at high call_rcu() rates.
|
2019-05-15 09:56:40 -07:00
|
|
|
*/
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
int nocb_nobypass_lim_per_jiffy = 16 * 1000 / HZ;
|
|
|
|
module_param(nocb_nobypass_lim_per_jiffy, int, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Acquire the specified rcu_data structure's ->nocb_bypass_lock. If the
|
|
|
|
* lock isn't immediately available, increment ->nocb_lock_contended to
|
|
|
|
* flag the contention.
|
|
|
|
*/
|
|
|
|
static void rcu_nocb_bypass_lock(struct rcu_data *rdp)
|
2020-01-20 22:42:15 +00:00
|
|
|
__acquires(&rdp->nocb_bypass_lock)
|
2019-05-15 09:56:40 -07:00
|
|
|
{
|
2019-05-28 07:18:08 -07:00
|
|
|
lockdep_assert_irqs_disabled();
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
if (raw_spin_trylock(&rdp->nocb_bypass_lock))
|
2019-05-28 07:18:08 -07:00
|
|
|
return;
|
|
|
|
atomic_inc(&rdp->nocb_lock_contended);
|
2019-07-13 12:27:03 -07:00
|
|
|
WARN_ON_ONCE(smp_processor_id() != rdp->cpu);
|
2019-05-28 07:18:08 -07:00
|
|
|
smp_mb__after_atomic(); /* atomic_inc() before lock. */
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
raw_spin_lock(&rdp->nocb_bypass_lock);
|
2019-05-28 07:18:08 -07:00
|
|
|
smp_mb__before_atomic(); /* atomic_dec() after lock. */
|
|
|
|
atomic_dec(&rdp->nocb_lock_contended);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Spinwait until the specified rcu_data structure's ->nocb_lock is
|
|
|
|
* not contended. Please note that this is extremely special-purpose,
|
|
|
|
* relying on the fact that at most two kthreads and one CPU contend for
|
|
|
|
* this lock, and also that the two kthreads are guaranteed to have frequent
|
|
|
|
* grace-period-duration time intervals between successive acquisitions
|
|
|
|
* of the lock. This allows us to use an extremely simple throttling
|
|
|
|
* mechanism, and further to apply it only to the CPU doing floods of
|
|
|
|
* call_rcu() invocations. Don't try this at home!
|
|
|
|
*/
|
|
|
|
static void rcu_nocb_wait_contended(struct rcu_data *rdp)
|
|
|
|
{
|
2019-07-13 12:27:03 -07:00
|
|
|
WARN_ON_ONCE(smp_processor_id() != rdp->cpu);
|
|
|
|
while (WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended)))
|
2019-05-28 07:18:08 -07:00
|
|
|
cpu_relax();
|
2019-05-15 09:56:40 -07:00
|
|
|
}
|
|
|
|
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
/*
|
|
|
|
* Conditionally acquire the specified rcu_data structure's
|
|
|
|
* ->nocb_bypass_lock.
|
|
|
|
*/
|
|
|
|
static bool rcu_nocb_bypass_trylock(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
lockdep_assert_irqs_disabled();
|
|
|
|
return raw_spin_trylock(&rdp->nocb_bypass_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Release the specified rcu_data structure's ->nocb_bypass_lock.
|
|
|
|
*/
|
|
|
|
static void rcu_nocb_bypass_unlock(struct rcu_data *rdp)
|
2020-01-30 00:30:09 +00:00
|
|
|
__releases(&rdp->nocb_bypass_lock)
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
{
|
|
|
|
lockdep_assert_irqs_disabled();
|
|
|
|
raw_spin_unlock(&rdp->nocb_bypass_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Acquire the specified rcu_data structure's ->nocb_lock, but only
|
|
|
|
* if it corresponds to a no-CBs CPU.
|
|
|
|
*/
|
|
|
|
static void rcu_nocb_lock(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
lockdep_assert_irqs_disabled();
|
2020-11-12 01:51:21 +01:00
|
|
|
if (!rcu_rdp_is_offloaded(rdp))
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
return;
|
|
|
|
raw_spin_lock(&rdp->nocb_lock);
|
|
|
|
}
|
|
|
|
|
2019-05-15 09:56:40 -07:00
|
|
|
/*
|
|
|
|
* Release the specified rcu_data structure's ->nocb_lock, but only
|
|
|
|
* if it corresponds to a no-CBs CPU.
|
|
|
|
*/
|
|
|
|
static void rcu_nocb_unlock(struct rcu_data *rdp)
|
|
|
|
{
|
2020-11-12 01:51:21 +01:00
|
|
|
if (rcu_rdp_is_offloaded(rdp)) {
|
2019-05-15 09:56:40 -07:00
|
|
|
lockdep_assert_irqs_disabled();
|
|
|
|
raw_spin_unlock(&rdp->nocb_lock);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Release the specified rcu_data structure's ->nocb_lock and restore
|
|
|
|
* interrupts, but only if it corresponds to a no-CBs CPU.
|
|
|
|
*/
|
|
|
|
static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
|
|
|
|
unsigned long flags)
|
|
|
|
{
|
2020-11-12 01:51:21 +01:00
|
|
|
if (rcu_rdp_is_offloaded(rdp)) {
|
2019-05-15 09:56:40 -07:00
|
|
|
lockdep_assert_irqs_disabled();
|
|
|
|
raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags);
|
|
|
|
} else {
|
|
|
|
local_irq_restore(flags);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
/* Lockdep check that ->cblist may be safely accessed. */
|
|
|
|
static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
lockdep_assert_irqs_disabled();
|
2020-11-12 01:51:21 +01:00
|
|
|
if (rcu_rdp_is_offloaded(rdp))
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
lockdep_assert_held(&rdp->nocb_lock);
|
|
|
|
}
|
|
|
|
|
2013-02-10 20:48:58 -08:00
|
|
|
/*
|
2012-12-30 15:21:01 -08:00
|
|
|
* Wake up any no-CBs CPUs' kthreads that were waiting on the just-ended
|
|
|
|
* grace period.
|
2013-02-10 20:48:58 -08:00
|
|
|
*/
|
2016-02-19 09:46:41 +01:00
|
|
|
static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq)
|
2013-02-10 20:48:58 -08:00
|
|
|
{
|
2016-02-19 09:46:41 +01:00
|
|
|
swake_up_all(sq);
|
2013-02-10 20:48:58 -08:00
|
|
|
}
|
|
|
|
|
2016-02-19 09:46:41 +01:00
|
|
|
static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp)
|
2016-02-19 09:46:40 +01:00
|
|
|
{
|
2018-04-27 20:51:36 -07:00
|
|
|
return &rnp->nocb_gp_wq[rcu_seq_ctr(rnp->gp_seq) & 0x1];
|
2016-02-19 09:46:40 +01:00
|
|
|
}
|
|
|
|
|
2013-02-10 20:48:58 -08:00
|
|
|
static void rcu_init_one_nocb(struct rcu_node *rnp)
|
2013-01-07 13:37:42 -08:00
|
|
|
{
|
2016-02-19 09:46:41 +01:00
|
|
|
init_swait_queue_head(&rnp->nocb_gp_wq[0]);
|
|
|
|
init_swait_queue_head(&rnp->nocb_gp_wq[1]);
|
2013-01-07 13:37:42 -08:00
|
|
|
}
|
|
|
|
|
2014-02-24 06:18:09 -08:00
|
|
|
/* Is the specified CPU a no-CBs CPU? */
|
2013-03-26 23:47:24 +01:00
|
|
|
bool rcu_is_nocb_cpu(int cpu)
|
2012-08-19 21:35:53 -07:00
|
|
|
{
|
2017-11-17 21:40:15 +06:00
|
|
|
if (cpumask_available(rcu_nocb_mask))
|
2012-08-19 21:35:53 -07:00
|
|
|
return cpumask_test_cpu(cpu, rcu_nocb_mask);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
/*
|
2019-03-28 15:44:18 -07:00
|
|
|
* Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock
|
2017-04-29 20:03:20 -07:00
|
|
|
* and this function releases it.
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
*/
|
2021-02-01 00:05:46 +01:00
|
|
|
static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
|
|
|
|
unsigned long flags)
|
2017-04-29 20:03:20 -07:00
|
|
|
__releases(rdp->nocb_lock)
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
{
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
bool needwake = false;
|
2019-03-31 16:11:57 -07:00
|
|
|
struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
|
2017-04-29 20:03:20 -07:00
|
|
|
lockdep_assert_held(&rdp->nocb_lock);
|
2019-03-31 16:11:57 -07:00
|
|
|
if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
|
|
|
|
TPS("AlreadyAwake"));
|
2019-05-28 07:18:08 -07:00
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
2021-02-01 00:05:46 +01:00
|
|
|
return false;
|
2017-04-29 20:03:20 -07:00
|
|
|
}
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
del_timer(&rdp->nocb_timer);
|
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
|
|
|
raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
|
|
|
|
if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
|
2019-05-15 09:56:40 -07:00
|
|
|
WRITE_ONCE(rdp_gp->nocb_gp_sleep, false);
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
needwake = true;
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DoWake"));
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
}
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
|
|
|
|
if (needwake)
|
|
|
|
wake_up_process(rdp_gp->nocb_gp_kthread);
|
2021-02-01 00:05:46 +01:00
|
|
|
|
|
|
|
return needwake;
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
}
|
|
|
|
|
2017-04-29 20:03:20 -07:00
|
|
|
/*
|
2019-03-28 15:44:18 -07:00
|
|
|
* Arrange to wake the GP kthread for this NOCB group at some future
|
|
|
|
* time when it is safe to do so.
|
2017-04-29 20:03:20 -07:00
|
|
|
*/
|
2019-03-31 16:19:02 -07:00
|
|
|
static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
|
|
|
|
const char *reason)
|
2017-04-29 20:03:20 -07:00
|
|
|
{
|
2020-11-13 13:13:23 +01:00
|
|
|
if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_OFF)
|
|
|
|
return;
|
2017-04-29 20:03:20 -07:00
|
|
|
if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT)
|
|
|
|
mod_timer(&rdp->nocb_timer, jiffies + 1);
|
2019-05-23 13:49:26 -07:00
|
|
|
if (rdp->nocb_defer_wakeup < waketype)
|
|
|
|
WRITE_ONCE(rdp->nocb_defer_wakeup, waketype);
|
2018-07-04 14:45:00 -07:00
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
|
rcu: Make rcu_barrier() understand about missing rcuo kthreads
Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online. This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.
It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix. The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.
It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread. This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks. It is therefore required to wait even for those callbacks
that cannot possibly be invoked. Even if doing so hangs the system.
Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case. Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().
So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.
Reported-by: Yanko Kaneti <yaneti@declera.com>
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eric B Munson <emunson@akamai.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Eric B Munson <emunson@akamai.com>
Tested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Yanko Kaneti <yaneti@declera.com>
Tested-by: Kevin Fenzi <kevin@scrye.com>
Tested-by: Meelis Roos <mroos@linux.ee>
2014-10-27 09:15:54 -07:00
|
|
|
}
|
|
|
|
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
/*
|
|
|
|
* Flush the ->nocb_bypass queue into ->cblist, enqueuing rhp if non-NULL.
|
|
|
|
* However, if there is a callback to be enqueued and if ->nocb_bypass
|
|
|
|
* proves to be initially empty, just return false because the no-CB GP
|
|
|
|
* kthread may need to be awakened in this case.
|
|
|
|
*
|
|
|
|
* Note that this function always returns true if rhp is NULL.
|
|
|
|
*/
|
|
|
|
static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
|
|
|
|
unsigned long j)
|
|
|
|
{
|
|
|
|
struct rcu_cblist rcl;
|
|
|
|
|
2020-11-12 01:51:21 +01:00
|
|
|
WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
rcu_lockdep_assert_cblist_protected(rdp);
|
|
|
|
lockdep_assert_held(&rdp->nocb_bypass_lock);
|
|
|
|
if (rhp && !rcu_cblist_n_cbs(&rdp->nocb_bypass)) {
|
|
|
|
raw_spin_unlock(&rdp->nocb_bypass_lock);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
|
|
|
|
if (rhp)
|
|
|
|
rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
|
|
|
|
rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
|
|
|
|
rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
|
|
|
|
WRITE_ONCE(rdp->nocb_bypass_first, j);
|
|
|
|
rcu_nocb_bypass_unlock(rdp);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush the ->nocb_bypass queue into ->cblist, enqueuing rhp if non-NULL.
|
|
|
|
* However, if there is a callback to be enqueued and if ->nocb_bypass
|
|
|
|
* proves to be initially empty, just return false because the no-CB GP
|
|
|
|
* kthread may need to be awakened in this case.
|
|
|
|
*
|
|
|
|
* Note that this function always returns true if rhp is NULL.
|
|
|
|
*/
|
|
|
|
static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
|
|
|
|
unsigned long j)
|
|
|
|
{
|
2020-11-12 01:51:21 +01:00
|
|
|
if (!rcu_rdp_is_offloaded(rdp))
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
return true;
|
|
|
|
rcu_lockdep_assert_cblist_protected(rdp);
|
|
|
|
rcu_nocb_bypass_lock(rdp);
|
|
|
|
return rcu_nocb_do_flush_bypass(rdp, rhp, j);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the ->nocb_bypass_lock is immediately available, flush the
|
|
|
|
* ->nocb_bypass queue into ->cblist.
|
|
|
|
*/
|
|
|
|
static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
|
|
|
|
{
|
|
|
|
rcu_lockdep_assert_cblist_protected(rdp);
|
2020-11-12 01:51:21 +01:00
|
|
|
if (!rcu_rdp_is_offloaded(rdp) ||
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
!rcu_nocb_bypass_trylock(rdp))
|
|
|
|
return;
|
|
|
|
WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See whether it is appropriate to use the ->nocb_bypass list in order
|
|
|
|
* to control contention on ->nocb_lock. A limited number of direct
|
|
|
|
* enqueues are permitted into ->cblist per jiffy. If ->nocb_bypass
|
|
|
|
* is non-empty, further callbacks must be placed into ->nocb_bypass,
|
|
|
|
* otherwise rcu_barrier() breaks. Use rcu_nocb_flush_bypass() to switch
|
|
|
|
* back to direct use of ->cblist. However, ->nocb_bypass should not be
|
|
|
|
* used if ->cblist is empty, because otherwise callbacks can be stranded
|
|
|
|
* on ->nocb_bypass because we cannot count on the current CPU ever again
|
|
|
|
* invoking call_rcu(). The general rule is that if ->nocb_bypass is
|
|
|
|
* non-empty, the corresponding no-CBs grace-period kthread must not be
|
|
|
|
* in an indefinite sleep state.
|
|
|
|
*
|
|
|
|
* Finally, it is not permitted to use the bypass during early boot,
|
|
|
|
* as doing so would confuse the auto-initialization code. Besides
|
|
|
|
* which, there is no point in worrying about lock contention while
|
|
|
|
* there is only one CPU in operation.
|
|
|
|
*/
|
|
|
|
static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
|
|
|
|
bool *was_alldone, unsigned long flags)
|
|
|
|
{
|
|
|
|
unsigned long c;
|
|
|
|
unsigned long cur_gp_seq;
|
|
|
|
unsigned long j = jiffies;
|
|
|
|
long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
|
|
|
|
|
2020-11-12 01:51:21 +01:00
|
|
|
if (!rcu_rdp_is_offloaded(rdp)) {
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
|
|
|
|
return false; /* Not offloaded, no bypassing. */
|
|
|
|
}
|
|
|
|
lockdep_assert_irqs_disabled();
|
|
|
|
|
|
|
|
// Don't use ->nocb_bypass during early boot.
|
|
|
|
if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) {
|
|
|
|
rcu_nocb_lock(rdp);
|
|
|
|
WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
|
|
|
|
*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
// If we have advanced to a new jiffy, reset counts to allow
|
|
|
|
// moving back from ->nocb_bypass to ->cblist.
|
|
|
|
if (j == rdp->nocb_nobypass_last) {
|
|
|
|
c = rdp->nocb_nobypass_count + 1;
|
|
|
|
} else {
|
|
|
|
WRITE_ONCE(rdp->nocb_nobypass_last, j);
|
|
|
|
c = rdp->nocb_nobypass_count - nocb_nobypass_lim_per_jiffy;
|
|
|
|
if (ULONG_CMP_LT(rdp->nocb_nobypass_count,
|
|
|
|
nocb_nobypass_lim_per_jiffy))
|
|
|
|
c = 0;
|
|
|
|
else if (c > nocb_nobypass_lim_per_jiffy)
|
|
|
|
c = nocb_nobypass_lim_per_jiffy;
|
|
|
|
}
|
|
|
|
WRITE_ONCE(rdp->nocb_nobypass_count, c);
|
|
|
|
|
|
|
|
// If there hasn't yet been all that many ->cblist enqueues
|
|
|
|
// this jiffy, tell the caller to enqueue onto ->cblist. But flush
|
|
|
|
// ->nocb_bypass first.
|
|
|
|
if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
|
|
|
|
rcu_nocb_lock(rdp);
|
|
|
|
*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
|
|
|
|
if (*was_alldone)
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
|
|
|
|
TPS("FirstQ"));
|
|
|
|
WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
|
|
|
|
WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
|
|
|
|
return false; // Caller must enqueue the callback.
|
|
|
|
}
|
|
|
|
|
|
|
|
// If ->nocb_bypass has been used too long or is too full,
|
|
|
|
// flush ->nocb_bypass to ->cblist.
|
|
|
|
if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
|
|
|
|
ncbs >= qhimark) {
|
|
|
|
rcu_nocb_lock(rdp);
|
|
|
|
if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
|
|
|
|
*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
|
|
|
|
if (*was_alldone)
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
|
|
|
|
TPS("FirstQ"));
|
|
|
|
WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
|
|
|
|
return false; // Caller must enqueue the callback.
|
|
|
|
}
|
|
|
|
if (j != rdp->nocb_gp_adv_time &&
|
|
|
|
rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
|
|
|
|
rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
|
|
|
|
rcu_advance_cbs_nowake(rdp->mynode, rdp);
|
|
|
|
rdp->nocb_gp_adv_time = j;
|
|
|
|
}
|
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
|
|
|
return true; // Callback already enqueued.
|
|
|
|
}
|
|
|
|
|
|
|
|
// We need to use the bypass.
|
|
|
|
rcu_nocb_wait_contended(rdp);
|
|
|
|
rcu_nocb_bypass_lock(rdp);
|
|
|
|
ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
|
|
|
|
rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
|
|
|
|
rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
|
|
|
|
if (!ncbs) {
|
|
|
|
WRITE_ONCE(rdp->nocb_bypass_first, j);
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
|
|
|
|
}
|
|
|
|
rcu_nocb_bypass_unlock(rdp);
|
|
|
|
smp_mb(); /* Order enqueue before wake. */
|
|
|
|
if (ncbs) {
|
|
|
|
local_irq_restore(flags);
|
|
|
|
} else {
|
|
|
|
// No-CBs GP kthread might be indefinitely asleep, if so, wake.
|
|
|
|
rcu_nocb_lock(rdp); // Rare during call_rcu() flood.
|
|
|
|
if (!rcu_segcblist_pend_cbs(&rdp->cblist)) {
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
|
|
|
|
TPS("FirstBQwake"));
|
|
|
|
__call_rcu_nocb_wake(rdp, true, flags);
|
|
|
|
} else {
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
|
|
|
|
TPS("FirstBQnoWake"));
|
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return true; // Callback already enqueued.
|
|
|
|
}
|
|
|
|
|
2012-08-19 21:35:53 -07:00
|
|
|
/*
|
2019-05-15 09:56:40 -07:00
|
|
|
* Awaken the no-CBs grace-period kthead if needed, either due to it
|
|
|
|
* legitimately being asleep or due to overload conditions.
|
2012-08-19 21:35:53 -07:00
|
|
|
*
|
|
|
|
* If warranted, also wake up the kthread servicing this CPUs queues.
|
|
|
|
*/
|
2019-05-15 09:56:40 -07:00
|
|
|
static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
|
|
|
|
unsigned long flags)
|
|
|
|
__releases(rdp->nocb_lock)
|
2012-08-19 21:35:53 -07:00
|
|
|
{
|
2019-07-15 06:06:40 -07:00
|
|
|
unsigned long cur_gp_seq;
|
|
|
|
unsigned long j;
|
2019-05-23 13:56:12 -07:00
|
|
|
long len;
|
2012-08-19 21:35:53 -07:00
|
|
|
struct task_struct *t;
|
|
|
|
|
2019-05-15 09:56:40 -07:00
|
|
|
// If we are being polled or there is no kthread, just leave.
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
t = READ_ONCE(rdp->nocb_gp_kthread);
|
2013-10-15 12:47:04 -07:00
|
|
|
if (rcu_nocb_poll || !t) {
|
2018-07-04 14:45:00 -07:00
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
|
2013-08-14 16:24:26 -07:00
|
|
|
TPS("WakeNotPoll"));
|
2019-05-15 09:56:40 -07:00
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
2012-08-19 21:35:53 -07:00
|
|
|
return;
|
2013-08-14 16:24:26 -07:00
|
|
|
}
|
2019-05-15 09:56:40 -07:00
|
|
|
// Need to actually to a wakeup.
|
|
|
|
len = rcu_segcblist_n_cbs(&rdp->cblist);
|
|
|
|
if (was_alldone) {
|
2019-05-23 10:43:58 -07:00
|
|
|
rdp->qlen_last_fqs_check = len;
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
if (!irqs_disabled_flags(flags)) {
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
/* ... if queue was empty ... */
|
2019-05-15 09:56:40 -07:00
|
|
|
wake_nocb_gp(rdp, false, flags);
|
2018-07-04 14:45:00 -07:00
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
TPS("WakeEmpty"));
|
|
|
|
} else {
|
2019-03-31 16:19:02 -07:00
|
|
|
wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE,
|
|
|
|
TPS("WakeEmptyIsDeferred"));
|
2019-05-15 09:56:40 -07:00
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
}
|
2012-08-19 21:35:53 -07:00
|
|
|
} else if (len > rdp->qlen_last_fqs_check + qhimark) {
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
/* ... or if many callbacks queued. */
|
2019-05-23 10:43:58 -07:00
|
|
|
rdp->qlen_last_fqs_check = len;
|
2019-07-15 06:06:40 -07:00
|
|
|
j = jiffies;
|
|
|
|
if (j != rdp->nocb_gp_adv_time &&
|
|
|
|
rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
|
|
|
|
rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) {
|
2019-06-26 09:50:38 -07:00
|
|
|
rcu_advance_cbs_nowake(rdp->mynode, rdp);
|
2019-07-15 06:06:40 -07:00
|
|
|
rdp->nocb_gp_adv_time = j;
|
|
|
|
}
|
2019-07-16 02:17:00 -07:00
|
|
|
smp_mb(); /* Enqueue before timer_pending(). */
|
|
|
|
if ((rdp->nocb_cb_sleep ||
|
|
|
|
!rcu_segcblist_ready_cbs(&rdp->cblist)) &&
|
|
|
|
!timer_pending(&rdp->nocb_bypass_timer))
|
2019-07-09 06:54:42 -07:00
|
|
|
wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE,
|
|
|
|
TPS("WakeOvfIsDeferred"));
|
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
2013-08-14 16:24:26 -07:00
|
|
|
} else {
|
2018-07-04 14:45:00 -07:00
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
|
2019-05-15 09:56:40 -07:00
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
2012-08-19 21:35:53 -07:00
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
/* Wake up the no-CBs GP kthread to flush ->nocb_bypass. */
|
|
|
|
static void do_nocb_bypass_wakeup_timer(struct timer_list *t)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
struct rcu_data *rdp = from_timer(rdp, t, nocb_bypass_timer);
|
|
|
|
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer"));
|
|
|
|
rcu_nocb_lock_irqsave(rdp, flags);
|
2019-07-16 02:17:00 -07:00
|
|
|
smp_mb__after_spinlock(); /* Timer expire before wakeup. */
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
__call_rcu_nocb_wake(rdp, true, flags);
|
|
|
|
}
|
|
|
|
|
2020-11-13 13:13:22 +01:00
|
|
|
/*
|
|
|
|
* Check if we ignore this rdp.
|
|
|
|
*
|
|
|
|
* We check that without holding the nocb lock but
|
|
|
|
* we make sure not to miss a freshly offloaded rdp
|
|
|
|
* with the current ordering:
|
|
|
|
*
|
|
|
|
* rdp_offload_toggle() nocb_gp_enabled_cb()
|
|
|
|
* ------------------------- ----------------------------
|
|
|
|
* WRITE flags LOCK nocb_gp_lock
|
|
|
|
* LOCK nocb_gp_lock READ/WRITE nocb_gp_sleep
|
|
|
|
* READ/WRITE nocb_gp_sleep UNLOCK nocb_gp_lock
|
|
|
|
* UNLOCK nocb_gp_lock READ flags
|
|
|
|
*/
|
2020-11-13 13:13:21 +01:00
|
|
|
static inline bool nocb_gp_enabled_cb(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
u8 flags = SEGCBLIST_OFFLOADED | SEGCBLIST_KTHREAD_GP;
|
|
|
|
|
|
|
|
return rcu_segcblist_test_flags(&rdp->cblist, flags);
|
|
|
|
}
|
|
|
|
|
2021-01-28 18:12:13 +01:00
|
|
|
static inline bool nocb_gp_update_state_deoffloading(struct rcu_data *rdp,
|
|
|
|
bool *needwake_state)
|
2020-11-13 13:13:21 +01:00
|
|
|
{
|
|
|
|
struct rcu_segcblist *cblist = &rdp->cblist;
|
|
|
|
|
|
|
|
if (rcu_segcblist_test_flags(cblist, SEGCBLIST_OFFLOADED)) {
|
2020-11-13 13:13:22 +01:00
|
|
|
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_GP)) {
|
|
|
|
rcu_segcblist_set_flags(cblist, SEGCBLIST_KTHREAD_GP);
|
|
|
|
if (rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB))
|
|
|
|
*needwake_state = true;
|
|
|
|
}
|
2021-01-28 18:12:13 +01:00
|
|
|
return false;
|
2020-11-13 13:13:21 +01:00
|
|
|
}
|
2020-12-21 11:17:16 -08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* De-offloading. Clear our flag and notify the de-offload worker.
|
|
|
|
* We will ignore this rdp until it ever gets re-offloaded.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_GP));
|
|
|
|
rcu_segcblist_clear_flags(cblist, SEGCBLIST_KTHREAD_GP);
|
|
|
|
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB))
|
|
|
|
*needwake_state = true;
|
2021-01-28 18:12:13 +01:00
|
|
|
return true;
|
2020-11-13 13:13:21 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2012-08-19 21:35:53 -07:00
|
|
|
/*
|
2019-05-15 09:56:40 -07:00
|
|
|
* No-CBs GP kthreads come here to wait for additional callbacks to show up
|
|
|
|
* or for grace periods to end.
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
*/
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
static void nocb_gp_wait(struct rcu_data *my_rdp)
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
{
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
bool bypass = false;
|
|
|
|
long bypass_ncbs;
|
2019-05-15 09:56:40 -07:00
|
|
|
int __maybe_unused cpu = my_rdp->cpu;
|
|
|
|
unsigned long cur_gp_seq;
|
2017-04-29 20:03:20 -07:00
|
|
|
unsigned long flags;
|
2019-09-23 17:26:34 +03:00
|
|
|
bool gotcbs = false;
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
unsigned long j = jiffies;
|
2019-05-22 09:35:11 -07:00
|
|
|
bool needwait_gp = false; // This prevents actual uninitialized use.
|
2019-05-15 09:56:40 -07:00
|
|
|
bool needwake;
|
|
|
|
bool needwake_gp;
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
struct rcu_data *rdp;
|
2019-05-15 09:56:40 -07:00
|
|
|
struct rcu_node *rnp;
|
2019-05-22 09:35:11 -07:00
|
|
|
unsigned long wait_gp_seq = 0; // Suppress "use uninitialized" warning.
|
2020-02-04 14:55:29 -08:00
|
|
|
bool wasempty = false;
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
|
|
|
|
/*
|
2019-05-15 09:56:40 -07:00
|
|
|
* Each pass through the following loop checks for CBs and for the
|
|
|
|
* nearest grace period (if any) to wait for next. The CB kthreads
|
|
|
|
* and the global grace-period kthread are awakened if needed.
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
*/
|
2020-08-05 10:35:16 -07:00
|
|
|
WARN_ON_ONCE(my_rdp->nocb_gp_rdp != my_rdp);
|
2019-03-28 15:33:59 -07:00
|
|
|
for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) {
|
2020-11-13 13:13:21 +01:00
|
|
|
bool needwake_state = false;
|
2020-12-21 11:17:16 -08:00
|
|
|
|
2020-11-13 13:13:21 +01:00
|
|
|
if (!nocb_gp_enabled_cb(rdp))
|
|
|
|
continue;
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
|
|
|
|
rcu_nocb_lock_irqsave(rdp, flags);
|
2021-01-28 18:12:13 +01:00
|
|
|
if (nocb_gp_update_state_deoffloading(rdp, &needwake_state)) {
|
2020-11-13 13:13:21 +01:00
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
|
|
|
if (needwake_state)
|
|
|
|
swake_up_one(&rdp->nocb_state_wq);
|
|
|
|
continue;
|
|
|
|
}
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
|
|
|
|
if (bypass_ncbs &&
|
|
|
|
(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
|
|
|
|
bypass_ncbs > 2 * qhimark)) {
|
|
|
|
// Bypass full or old, so flush it.
|
|
|
|
(void)rcu_nocb_try_flush_bypass(rdp, j);
|
|
|
|
bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
|
|
|
|
} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
|
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
2020-11-13 13:13:22 +01:00
|
|
|
if (needwake_state)
|
|
|
|
swake_up_one(&rdp->nocb_state_wq);
|
2019-05-15 09:56:40 -07:00
|
|
|
continue; /* No callbacks here, try next. */
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
}
|
|
|
|
if (bypass_ncbs) {
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
|
|
|
|
TPS("Bypass"));
|
|
|
|
bypass = true;
|
|
|
|
}
|
2019-05-15 09:56:40 -07:00
|
|
|
rnp = rdp->mynode;
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
if (bypass) { // Avoid race with first bypass CB.
|
|
|
|
WRITE_ONCE(my_rdp->nocb_defer_wakeup,
|
|
|
|
RCU_NOCB_WAKE_NOT);
|
|
|
|
del_timer(&my_rdp->nocb_timer);
|
|
|
|
}
|
|
|
|
// Advance callbacks if helpful and low contention.
|
|
|
|
needwake_gp = false;
|
|
|
|
if (!rcu_segcblist_restempty(&rdp->cblist,
|
|
|
|
RCU_NEXT_READY_TAIL) ||
|
|
|
|
(rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
|
|
|
|
rcu_seq_done(&rnp->gp_seq, cur_gp_seq))) {
|
|
|
|
raw_spin_lock_rcu_node(rnp); /* irqs disabled. */
|
|
|
|
needwake_gp = rcu_advance_cbs(rnp, rdp);
|
2020-02-04 14:55:29 -08:00
|
|
|
wasempty = rcu_segcblist_restempty(&rdp->cblist,
|
|
|
|
RCU_NEXT_READY_TAIL);
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
raw_spin_unlock_rcu_node(rnp); /* irqs disabled. */
|
|
|
|
}
|
2019-05-15 09:56:40 -07:00
|
|
|
// Need to wait on some grace period?
|
2020-02-04 14:55:29 -08:00
|
|
|
WARN_ON_ONCE(wasempty &&
|
|
|
|
!rcu_segcblist_restempty(&rdp->cblist,
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
RCU_NEXT_READY_TAIL));
|
2019-05-15 09:56:40 -07:00
|
|
|
if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq)) {
|
|
|
|
if (!needwait_gp ||
|
|
|
|
ULONG_CMP_LT(cur_gp_seq, wait_gp_seq))
|
|
|
|
wait_gp_seq = cur_gp_seq;
|
|
|
|
needwait_gp = true;
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
|
|
|
|
TPS("NeedWaitGP"));
|
2017-04-29 20:03:20 -07:00
|
|
|
}
|
2019-05-15 09:56:40 -07:00
|
|
|
if (rcu_segcblist_ready_cbs(&rdp->cblist)) {
|
|
|
|
needwake = rdp->nocb_cb_sleep;
|
|
|
|
WRITE_ONCE(rdp->nocb_cb_sleep, false);
|
|
|
|
smp_mb(); /* CB invocation -after- GP end. */
|
|
|
|
} else {
|
|
|
|
needwake = false;
|
2017-04-29 20:03:20 -07:00
|
|
|
}
|
2019-05-28 07:18:08 -07:00
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
2019-05-15 09:56:40 -07:00
|
|
|
if (needwake) {
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
swake_up_one(&rdp->nocb_cb_wq);
|
2019-05-15 09:56:40 -07:00
|
|
|
gotcbs = true;
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
}
|
2019-05-15 09:56:40 -07:00
|
|
|
if (needwake_gp)
|
|
|
|
rcu_gp_kthread_wake();
|
2020-11-13 13:13:22 +01:00
|
|
|
if (needwake_state)
|
|
|
|
swake_up_one(&rdp->nocb_state_wq);
|
2019-05-15 09:56:40 -07:00
|
|
|
}
|
|
|
|
|
2019-06-25 13:32:51 -07:00
|
|
|
my_rdp->nocb_gp_bypass = bypass;
|
|
|
|
my_rdp->nocb_gp_gp = needwait_gp;
|
|
|
|
my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
if (bypass && !rcu_nocb_poll) {
|
|
|
|
// At least one child with non-empty ->nocb_bypass, so set
|
|
|
|
// timer in order to avoid stranding its callbacks.
|
|
|
|
raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
|
|
|
|
mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
|
|
|
|
raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
|
|
|
|
}
|
2019-05-15 09:56:40 -07:00
|
|
|
if (rcu_nocb_poll) {
|
|
|
|
/* Polling, so trace if first poll in the series. */
|
|
|
|
if (gotcbs)
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Poll"));
|
2020-05-07 16:36:10 -07:00
|
|
|
schedule_timeout_idle(1);
|
2019-05-15 09:56:40 -07:00
|
|
|
} else if (!needwait_gp) {
|
|
|
|
/* Wait for callbacks to appear. */
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Sleep"));
|
|
|
|
swait_event_interruptible_exclusive(my_rdp->nocb_gp_wq,
|
|
|
|
!READ_ONCE(my_rdp->nocb_gp_sleep));
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("EndSleep"));
|
2019-05-15 09:56:40 -07:00
|
|
|
} else {
|
|
|
|
rnp = my_rdp->mynode;
|
|
|
|
trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("StartWait"));
|
|
|
|
swait_event_interruptible_exclusive(
|
|
|
|
rnp->nocb_gp_wq[rcu_seq_ctr(wait_gp_seq) & 0x1],
|
|
|
|
rcu_seq_done(&rnp->gp_seq, wait_gp_seq) ||
|
|
|
|
!READ_ONCE(my_rdp->nocb_gp_sleep));
|
|
|
|
trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("EndWait"));
|
|
|
|
}
|
|
|
|
if (!rcu_nocb_poll) {
|
2019-06-02 13:41:08 -07:00
|
|
|
raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
if (bypass)
|
|
|
|
del_timer(&my_rdp->nocb_bypass_timer);
|
2019-05-15 09:56:40 -07:00
|
|
|
WRITE_ONCE(my_rdp->nocb_gp_sleep, true);
|
2019-06-02 13:41:08 -07:00
|
|
|
raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
}
|
2019-06-25 13:32:51 -07:00
|
|
|
my_rdp->nocb_gp_seq = -1;
|
2019-05-15 09:56:40 -07:00
|
|
|
WARN_ON(signal_pending(current));
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
}
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
/*
|
|
|
|
* No-CBs grace-period-wait kthread. There is one of these per group
|
|
|
|
* of CPUs, but only once at least one CPU in that group has come online
|
|
|
|
* at least once since boot. This kthread checks for newly posted
|
|
|
|
* callbacks from any of the CPUs it is responsible for, waits for a
|
|
|
|
* grace period, then awakens all of the rcu_nocb_cb_kthread() instances
|
|
|
|
* that then have callback-invocation work to do.
|
|
|
|
*/
|
|
|
|
static int rcu_nocb_gp_kthread(void *arg)
|
|
|
|
{
|
|
|
|
struct rcu_data *rdp = arg;
|
|
|
|
|
2019-05-15 09:56:40 -07:00
|
|
|
for (;;) {
|
2019-06-25 13:32:51 -07:00
|
|
|
WRITE_ONCE(rdp->nocb_gp_loops, rdp->nocb_gp_loops + 1);
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
nocb_gp_wait(rdp);
|
2019-05-15 09:56:40 -07:00
|
|
|
cond_resched_tasks_rcu_qs();
|
|
|
|
}
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
return 0;
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
}
|
|
|
|
|
2020-11-13 13:13:19 +01:00
|
|
|
static inline bool nocb_cb_can_run(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
u8 flags = SEGCBLIST_OFFLOADED | SEGCBLIST_KTHREAD_CB;
|
|
|
|
return rcu_segcblist_test_flags(&rdp->cblist, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool nocb_cb_wait_cond(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
return nocb_cb_can_run(rdp) && !READ_ONCE(rdp->nocb_cb_sleep);
|
|
|
|
}
|
|
|
|
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
/*
|
2019-05-15 09:56:40 -07:00
|
|
|
* Invoke any ready callbacks from the corresponding no-CBs CPU,
|
|
|
|
* then, if there are no more, wait for more to appear.
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
*/
|
2019-05-15 09:56:40 -07:00
|
|
|
static void nocb_cb_wait(struct rcu_data *rdp)
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
{
|
2020-11-13 13:13:19 +01:00
|
|
|
struct rcu_segcblist *cblist = &rdp->cblist;
|
2019-07-15 01:09:04 -07:00
|
|
|
unsigned long cur_gp_seq;
|
2019-05-15 09:56:40 -07:00
|
|
|
unsigned long flags;
|
2020-12-21 11:17:16 -08:00
|
|
|
bool needwake_state = false;
|
2019-05-15 09:56:40 -07:00
|
|
|
bool needwake_gp = false;
|
2021-01-28 18:12:12 +01:00
|
|
|
bool can_sleep = true;
|
2019-05-15 09:56:40 -07:00
|
|
|
struct rcu_node *rnp = rdp->mynode;
|
|
|
|
|
|
|
|
local_irq_save(flags);
|
|
|
|
rcu_momentary_dyntick_idle();
|
|
|
|
local_irq_restore(flags);
|
2021-01-28 18:12:08 +01:00
|
|
|
/*
|
|
|
|
* Disable BH to provide the expected environment. Also, when
|
|
|
|
* transitioning to/from NOCB mode, a self-requeuing callback might
|
|
|
|
* be invoked from softirq. A short grace period could cause both
|
|
|
|
* instances of this callback would execute concurrently.
|
|
|
|
*/
|
2019-05-15 09:56:40 -07:00
|
|
|
local_bh_disable();
|
|
|
|
rcu_do_batch(rdp);
|
|
|
|
local_bh_enable();
|
|
|
|
lockdep_assert_irqs_enabled();
|
2019-05-28 07:18:08 -07:00
|
|
|
rcu_nocb_lock_irqsave(rdp, flags);
|
2020-11-13 13:13:19 +01:00
|
|
|
if (rcu_segcblist_nextgp(cblist, &cur_gp_seq) &&
|
2019-07-15 01:09:04 -07:00
|
|
|
rcu_seq_done(&rnp->gp_seq, cur_gp_seq) &&
|
|
|
|
raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */
|
2019-06-01 13:33:55 -07:00
|
|
|
needwake_gp = rcu_advance_cbs(rdp->mynode, rdp);
|
|
|
|
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
|
|
|
|
}
|
2019-05-15 09:56:40 -07:00
|
|
|
|
2020-11-13 13:13:19 +01:00
|
|
|
if (rcu_segcblist_test_flags(cblist, SEGCBLIST_OFFLOADED)) {
|
2020-11-13 13:13:22 +01:00
|
|
|
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB)) {
|
|
|
|
rcu_segcblist_set_flags(cblist, SEGCBLIST_KTHREAD_CB);
|
|
|
|
if (rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_GP))
|
|
|
|
needwake_state = true;
|
|
|
|
}
|
2020-11-13 13:13:19 +01:00
|
|
|
if (rcu_segcblist_ready_cbs(cblist))
|
2021-01-28 18:12:12 +01:00
|
|
|
can_sleep = false;
|
2020-11-13 13:13:19 +01:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* De-offloading. Clear our flag and notify the de-offload worker.
|
|
|
|
* We won't touch the callbacks and keep sleeping until we ever
|
|
|
|
* get re-offloaded.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB));
|
|
|
|
rcu_segcblist_clear_flags(cblist, SEGCBLIST_KTHREAD_CB);
|
|
|
|
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_GP))
|
|
|
|
needwake_state = true;
|
|
|
|
}
|
|
|
|
|
2021-01-28 18:12:12 +01:00
|
|
|
WRITE_ONCE(rdp->nocb_cb_sleep, can_sleep);
|
|
|
|
|
2020-11-13 13:13:19 +01:00
|
|
|
if (rdp->nocb_cb_sleep)
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep"));
|
|
|
|
|
2019-05-28 07:18:08 -07:00
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
2019-05-15 09:56:40 -07:00
|
|
|
if (needwake_gp)
|
|
|
|
rcu_gp_kthread_wake();
|
2020-11-13 13:13:19 +01:00
|
|
|
|
|
|
|
if (needwake_state)
|
|
|
|
swake_up_one(&rdp->nocb_state_wq);
|
|
|
|
|
|
|
|
do {
|
|
|
|
swait_event_interruptible_exclusive(rdp->nocb_cb_wq,
|
|
|
|
nocb_cb_wait_cond(rdp));
|
|
|
|
|
2020-12-21 11:17:16 -08:00
|
|
|
// VVV Ensure CB invocation follows _sleep test.
|
|
|
|
if (smp_load_acquire(&rdp->nocb_cb_sleep)) { // ^^^
|
2020-11-13 13:13:19 +01:00
|
|
|
WARN_ON(signal_pending(current));
|
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty"));
|
|
|
|
}
|
|
|
|
} while (!nocb_cb_can_run(rdp));
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
}
|
|
|
|
|
2012-08-19 21:35:53 -07:00
|
|
|
/*
|
2019-05-15 09:56:40 -07:00
|
|
|
* Per-rcu_data kthread, but only for no-CBs CPUs. Repeatedly invoke
|
|
|
|
* nocb_cb_wait() to do the dirty work.
|
2012-08-19 21:35:53 -07:00
|
|
|
*/
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
static int rcu_nocb_cb_kthread(void *arg)
|
2012-08-19 21:35:53 -07:00
|
|
|
{
|
|
|
|
struct rcu_data *rdp = arg;
|
|
|
|
|
2019-05-15 09:56:40 -07:00
|
|
|
// Each pass through this loop does one callback batch, and,
|
|
|
|
// if there are no more ready callbacks, waits for them.
|
2012-08-19 21:35:53 -07:00
|
|
|
for (;;) {
|
2019-05-15 09:56:40 -07:00
|
|
|
nocb_cb_wait(rdp);
|
|
|
|
cond_resched_tasks_rcu_qs();
|
2012-08-19 21:35:53 -07:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
/* Is a deferred wakeup of rcu_nocb_kthread() required? */
|
2014-07-29 14:50:47 -07:00
|
|
|
static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
{
|
2020-11-13 13:13:23 +01:00
|
|
|
return READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT;
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Do a deferred wakeup of rcu_nocb_kthread(). */
|
2021-02-01 00:05:46 +01:00
|
|
|
static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp)
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
{
|
2017-04-29 20:03:20 -07:00
|
|
|
unsigned long flags;
|
2014-07-29 14:50:47 -07:00
|
|
|
int ndw;
|
2021-02-01 00:05:46 +01:00
|
|
|
int ret;
|
2014-07-29 14:50:47 -07:00
|
|
|
|
2019-05-28 07:18:08 -07:00
|
|
|
rcu_nocb_lock_irqsave(rdp, flags);
|
2017-04-29 20:03:20 -07:00
|
|
|
if (!rcu_nocb_need_deferred_wakeup(rdp)) {
|
2019-05-28 07:18:08 -07:00
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
2021-02-01 00:05:46 +01:00
|
|
|
return false;
|
2017-04-29 20:03:20 -07:00
|
|
|
}
|
2015-03-03 14:57:58 -08:00
|
|
|
ndw = READ_ONCE(rdp->nocb_defer_wakeup);
|
2017-04-28 17:04:09 -07:00
|
|
|
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
|
2021-02-01 00:05:46 +01:00
|
|
|
ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
|
2018-07-04 14:45:00 -07:00
|
|
|
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
|
2021-02-01 00:05:46 +01:00
|
|
|
|
|
|
|
return ret;
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
}
|
|
|
|
|
2017-04-29 20:03:20 -07:00
|
|
|
/* Do a deferred wakeup of rcu_nocb_kthread() from a timer handler. */
|
2017-10-22 17:58:54 -07:00
|
|
|
static void do_nocb_deferred_wakeup_timer(struct timer_list *t)
|
2017-04-29 20:03:20 -07:00
|
|
|
{
|
2017-10-22 17:58:54 -07:00
|
|
|
struct rcu_data *rdp = from_timer(rdp, t, nocb_timer);
|
|
|
|
|
|
|
|
do_nocb_deferred_wakeup_common(rdp);
|
2017-04-29 20:03:20 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do a deferred wakeup of rcu_nocb_kthread() from fastpath.
|
|
|
|
* This means we do an inexact common-case check. Note that if
|
|
|
|
* we miss, ->nocb_timer will eventually clean things up.
|
|
|
|
*/
|
2021-02-01 00:05:46 +01:00
|
|
|
static bool do_nocb_deferred_wakeup(struct rcu_data *rdp)
|
2017-04-29 20:03:20 -07:00
|
|
|
{
|
|
|
|
if (rcu_nocb_need_deferred_wakeup(rdp))
|
2021-02-01 00:05:46 +01:00
|
|
|
return do_nocb_deferred_wakeup_common(rdp);
|
|
|
|
return false;
|
2017-04-29 20:03:20 -07:00
|
|
|
}
|
|
|
|
|
2021-02-01 00:05:45 +01:00
|
|
|
void rcu_nocb_flush_deferred_wakeup(void)
|
|
|
|
{
|
|
|
|
do_nocb_deferred_wakeup(this_cpu_ptr(&rcu_data));
|
2017-04-29 20:03:20 -07:00
|
|
|
}
|
2021-02-01 00:05:48 +01:00
|
|
|
EXPORT_SYMBOL_GPL(rcu_nocb_flush_deferred_wakeup);
|
2017-04-29 20:03:20 -07:00
|
|
|
|
2020-11-13 13:13:22 +01:00
|
|
|
static int rdp_offload_toggle(struct rcu_data *rdp,
|
|
|
|
bool offload, unsigned long flags)
|
|
|
|
__releases(rdp->nocb_lock)
|
2020-11-13 13:13:19 +01:00
|
|
|
{
|
|
|
|
struct rcu_segcblist *cblist = &rdp->cblist;
|
2020-11-13 13:13:21 +01:00
|
|
|
struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
|
2020-11-13 13:13:22 +01:00
|
|
|
bool wake_gp = false;
|
2020-11-13 13:13:20 +01:00
|
|
|
|
2020-11-13 13:13:22 +01:00
|
|
|
rcu_segcblist_offload(cblist, offload);
|
2020-11-13 13:13:19 +01:00
|
|
|
|
2020-11-13 13:13:22 +01:00
|
|
|
if (rdp->nocb_cb_sleep)
|
2020-11-13 13:13:19 +01:00
|
|
|
rdp->nocb_cb_sleep = false;
|
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
|
|
|
|
2020-11-13 13:13:22 +01:00
|
|
|
/*
|
|
|
|
* Ignore former value of nocb_cb_sleep and force wake up as it could
|
|
|
|
* have been spuriously set to false already.
|
|
|
|
*/
|
|
|
|
swake_up_one(&rdp->nocb_cb_wq);
|
2020-11-13 13:13:19 +01:00
|
|
|
|
2020-11-13 13:13:21 +01:00
|
|
|
raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
|
|
|
|
if (rdp_gp->nocb_gp_sleep) {
|
|
|
|
rdp_gp->nocb_gp_sleep = false;
|
|
|
|
wake_gp = true;
|
|
|
|
}
|
|
|
|
raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
|
2020-11-13 13:13:19 +01:00
|
|
|
|
2020-11-13 13:13:21 +01:00
|
|
|
if (wake_gp)
|
|
|
|
wake_up_process(rdp_gp->nocb_gp_kthread);
|
|
|
|
|
2020-11-13 13:13:22 +01:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-01-28 18:12:09 +01:00
|
|
|
static long rcu_nocb_rdp_deoffload(void *arg)
|
2020-11-13 13:13:22 +01:00
|
|
|
{
|
2021-01-28 18:12:09 +01:00
|
|
|
struct rcu_data *rdp = arg;
|
2020-11-13 13:13:22 +01:00
|
|
|
struct rcu_segcblist *cblist = &rdp->cblist;
|
|
|
|
unsigned long flags;
|
|
|
|
int ret;
|
|
|
|
|
2021-01-28 18:12:09 +01:00
|
|
|
WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
|
|
|
|
|
2020-12-21 11:17:16 -08:00
|
|
|
pr_info("De-offloading %d\n", rdp->cpu);
|
2020-11-13 13:13:22 +01:00
|
|
|
|
|
|
|
rcu_nocb_lock_irqsave(rdp, flags);
|
|
|
|
|
|
|
|
ret = rdp_offload_toggle(rdp, false, flags);
|
2020-11-13 13:13:21 +01:00
|
|
|
swait_event_exclusive(rdp->nocb_state_wq,
|
|
|
|
!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB |
|
|
|
|
SEGCBLIST_KTHREAD_GP));
|
2020-11-13 13:13:23 +01:00
|
|
|
rcu_nocb_lock_irqsave(rdp, flags);
|
2020-11-13 13:13:24 +01:00
|
|
|
/* Make sure nocb timer won't stay around */
|
2020-11-13 13:13:23 +01:00
|
|
|
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_OFF);
|
|
|
|
rcu_nocb_unlock_irqrestore(rdp, flags);
|
|
|
|
del_timer_sync(&rdp->nocb_timer);
|
|
|
|
|
2020-11-13 13:13:24 +01:00
|
|
|
/*
|
|
|
|
* Flush bypass. While IRQs are disabled and once we set
|
|
|
|
* SEGCBLIST_SOFTIRQ_ONLY, no callback is supposed to be
|
|
|
|
* enqueued on bypass.
|
|
|
|
*/
|
|
|
|
rcu_nocb_lock_irqsave(rdp, flags);
|
|
|
|
rcu_nocb_flush_bypass(rdp, NULL, jiffies);
|
2020-11-13 13:13:25 +01:00
|
|
|
rcu_segcblist_set_flags(cblist, SEGCBLIST_SOFTIRQ_ONLY);
|
|
|
|
/*
|
|
|
|
* With SEGCBLIST_SOFTIRQ_ONLY, we can't use
|
|
|
|
* rcu_nocb_unlock_irqrestore() anymore. Theoretically we
|
|
|
|
* could set SEGCBLIST_SOFTIRQ_ONLY with cb unlocked and IRQs
|
|
|
|
* disabled now, but let's be paranoid.
|
|
|
|
*/
|
|
|
|
raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags);
|
2020-11-13 13:13:24 +01:00
|
|
|
|
2020-11-13 13:13:22 +01:00
|
|
|
return ret;
|
2020-11-13 13:13:19 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
int rcu_nocb_cpu_deoffload(int cpu)
|
|
|
|
{
|
|
|
|
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (rdp == rdp->nocb_gp_rdp) {
|
|
|
|
pr_info("Can't deoffload an rdp GP leader (yet)\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
mutex_lock(&rcu_state.barrier_mutex);
|
|
|
|
cpus_read_lock();
|
2020-11-12 01:51:21 +01:00
|
|
|
if (rcu_rdp_is_offloaded(rdp)) {
|
2021-01-28 18:12:09 +01:00
|
|
|
if (cpu_online(cpu)) {
|
2020-11-13 13:13:19 +01:00
|
|
|
ret = work_on_cpu(cpu, rcu_nocb_rdp_deoffload, rdp);
|
2021-01-28 18:12:09 +01:00
|
|
|
if (!ret)
|
|
|
|
cpumask_clear_cpu(cpu, rcu_nocb_mask);
|
|
|
|
} else {
|
|
|
|
pr_info("NOCB: Can't CB-deoffload an offline CPU\n");
|
|
|
|
ret = -EINVAL;
|
|
|
|
}
|
2020-11-13 13:13:19 +01:00
|
|
|
}
|
|
|
|
cpus_read_unlock();
|
|
|
|
mutex_unlock(&rcu_state.barrier_mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rcu_nocb_cpu_deoffload);
|
|
|
|
|
2021-01-28 18:12:09 +01:00
|
|
|
static long rcu_nocb_rdp_offload(void *arg)
|
2020-11-13 13:13:22 +01:00
|
|
|
{
|
2021-01-28 18:12:09 +01:00
|
|
|
struct rcu_data *rdp = arg;
|
2020-11-13 13:13:22 +01:00
|
|
|
struct rcu_segcblist *cblist = &rdp->cblist;
|
|
|
|
unsigned long flags;
|
|
|
|
int ret;
|
|
|
|
|
2021-01-28 18:12:09 +01:00
|
|
|
WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
|
2020-11-13 13:13:22 +01:00
|
|
|
/*
|
|
|
|
* For now we only support re-offload, ie: the rdp must have been
|
|
|
|
* offloaded on boot first.
|
|
|
|
*/
|
|
|
|
if (!rdp->nocb_gp_rdp)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2020-12-21 11:17:16 -08:00
|
|
|
pr_info("Offloading %d\n", rdp->cpu);
|
2020-11-13 13:13:22 +01:00
|
|
|
/*
|
|
|
|
* Can't use rcu_nocb_lock_irqsave() while we are in
|
|
|
|
* SEGCBLIST_SOFTIRQ_ONLY mode.
|
|
|
|
*/
|
|
|
|
raw_spin_lock_irqsave(&rdp->nocb_lock, flags);
|
2020-11-13 13:13:23 +01:00
|
|
|
/* Re-enable nocb timer */
|
|
|
|
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
|
2020-11-13 13:13:22 +01:00
|
|
|
/*
|
|
|
|
* We didn't take the nocb lock while working on the
|
|
|
|
* rdp->cblist in SEGCBLIST_SOFTIRQ_ONLY mode.
|
|
|
|
* Every modifications that have been done previously on
|
|
|
|
* rdp->cblist must be visible remotely by the nocb kthreads
|
|
|
|
* upon wake up after reading the cblist flags.
|
|
|
|
*
|
|
|
|
* The layout against nocb_lock enforces that ordering:
|
|
|
|
*
|
|
|
|
* __rcu_nocb_rdp_offload() nocb_cb_wait()/nocb_gp_wait()
|
|
|
|
* ------------------------- ----------------------------
|
|
|
|
* WRITE callbacks rcu_nocb_lock()
|
|
|
|
* rcu_nocb_lock() READ flags
|
|
|
|
* WRITE flags READ callbacks
|
|
|
|
* rcu_nocb_unlock() rcu_nocb_unlock()
|
|
|
|
*/
|
|
|
|
ret = rdp_offload_toggle(rdp, true, flags);
|
|
|
|
swait_event_exclusive(rdp->nocb_state_wq,
|
|
|
|
rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB) &&
|
|
|
|
rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_GP));
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int rcu_nocb_cpu_offload(int cpu)
|
|
|
|
{
|
|
|
|
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
mutex_lock(&rcu_state.barrier_mutex);
|
|
|
|
cpus_read_lock();
|
2020-11-12 01:51:21 +01:00
|
|
|
if (!rcu_rdp_is_offloaded(rdp)) {
|
2021-01-28 18:12:09 +01:00
|
|
|
if (cpu_online(cpu)) {
|
2020-11-13 13:13:22 +01:00
|
|
|
ret = work_on_cpu(cpu, rcu_nocb_rdp_offload, rdp);
|
2021-01-28 18:12:09 +01:00
|
|
|
if (!ret)
|
|
|
|
cpumask_set_cpu(cpu, rcu_nocb_mask);
|
|
|
|
} else {
|
|
|
|
pr_info("NOCB: Can't CB-offload an offline CPU\n");
|
|
|
|
ret = -EINVAL;
|
|
|
|
}
|
2020-11-13 13:13:22 +01:00
|
|
|
}
|
|
|
|
cpus_read_unlock();
|
|
|
|
mutex_unlock(&rcu_state.barrier_mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rcu_nocb_cpu_offload);
|
|
|
|
|
2014-07-25 11:21:47 -07:00
|
|
|
void __init rcu_init_nohz(void)
|
|
|
|
{
|
|
|
|
int cpu;
|
2018-02-28 10:34:54 -08:00
|
|
|
bool need_rcu_nocb_mask = false;
|
2019-05-14 09:50:49 -07:00
|
|
|
struct rcu_data *rdp;
|
2014-07-25 11:21:47 -07:00
|
|
|
|
|
|
|
#if defined(CONFIG_NO_HZ_FULL)
|
|
|
|
if (tick_nohz_full_running && cpumask_weight(tick_nohz_full_mask))
|
|
|
|
need_rcu_nocb_mask = true;
|
|
|
|
#endif /* #if defined(CONFIG_NO_HZ_FULL) */
|
|
|
|
|
2017-11-17 21:40:15 +06:00
|
|
|
if (!cpumask_available(rcu_nocb_mask) && need_rcu_nocb_mask) {
|
2014-07-25 16:02:07 -07:00
|
|
|
if (!zalloc_cpumask_var(&rcu_nocb_mask, GFP_KERNEL)) {
|
|
|
|
pr_info("rcu_nocb_mask allocation failed, callback offloading disabled.\n");
|
|
|
|
return;
|
|
|
|
}
|
2014-07-25 11:21:47 -07:00
|
|
|
}
|
2017-11-17 21:40:15 +06:00
|
|
|
if (!cpumask_available(rcu_nocb_mask))
|
2014-07-25 11:21:47 -07:00
|
|
|
return;
|
|
|
|
|
|
|
|
#if defined(CONFIG_NO_HZ_FULL)
|
|
|
|
if (tick_nohz_full_running)
|
|
|
|
cpumask_or(rcu_nocb_mask, rcu_nocb_mask, tick_nohz_full_mask);
|
|
|
|
#endif /* #if defined(CONFIG_NO_HZ_FULL) */
|
|
|
|
|
|
|
|
if (!cpumask_subset(rcu_nocb_mask, cpu_possible_mask)) {
|
2018-02-28 10:34:54 -08:00
|
|
|
pr_info("\tNote: kernel parameter 'rcu_nocbs=', 'nohz_full', or 'isolcpus=' contains nonexistent CPUs.\n");
|
2014-07-25 11:21:47 -07:00
|
|
|
cpumask_and(rcu_nocb_mask, cpu_possible_mask,
|
|
|
|
rcu_nocb_mask);
|
|
|
|
}
|
2017-12-04 09:48:59 -08:00
|
|
|
if (cpumask_empty(rcu_nocb_mask))
|
|
|
|
pr_info("\tOffload RCU callbacks from CPUs: (none).\n");
|
|
|
|
else
|
|
|
|
pr_info("\tOffload RCU callbacks from CPUs: %*pbl.\n",
|
|
|
|
cpumask_pr_args(rcu_nocb_mask));
|
2014-07-25 11:21:47 -07:00
|
|
|
if (rcu_nocb_poll)
|
|
|
|
pr_info("\tPoll for callbacks from no-CBs CPUs.\n");
|
|
|
|
|
2019-05-14 09:50:49 -07:00
|
|
|
for_each_cpu(cpu, rcu_nocb_mask) {
|
|
|
|
rdp = per_cpu_ptr(&rcu_data, cpu);
|
|
|
|
if (rcu_segcblist_empty(&rdp->cblist))
|
|
|
|
rcu_segcblist_init(&rdp->cblist);
|
2020-11-13 13:13:19 +01:00
|
|
|
rcu_segcblist_offload(&rdp->cblist, true);
|
|
|
|
rcu_segcblist_set_flags(&rdp->cblist, SEGCBLIST_KTHREAD_CB);
|
2020-11-13 13:13:21 +01:00
|
|
|
rcu_segcblist_set_flags(&rdp->cblist, SEGCBLIST_KTHREAD_GP);
|
2019-05-14 09:50:49 -07:00
|
|
|
}
|
2018-07-04 15:35:00 -07:00
|
|
|
rcu_organize_nocb_kthreads();
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
}
|
|
|
|
|
2012-08-19 21:35:53 -07:00
|
|
|
/* Initialize per-rcu_data variables for no-CBs CPUs. */
|
|
|
|
static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
|
|
|
|
{
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
init_swait_queue_head(&rdp->nocb_cb_wq);
|
|
|
|
init_swait_queue_head(&rdp->nocb_gp_wq);
|
2020-11-13 13:13:19 +01:00
|
|
|
init_swait_queue_head(&rdp->nocb_state_wq);
|
2017-04-29 20:03:20 -07:00
|
|
|
raw_spin_lock_init(&rdp->nocb_lock);
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
raw_spin_lock_init(&rdp->nocb_bypass_lock);
|
2019-06-02 13:41:08 -07:00
|
|
|
raw_spin_lock_init(&rdp->nocb_gp_lock);
|
2017-10-22 17:58:54 -07:00
|
|
|
timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
timer_setup(&rdp->nocb_bypass_timer, do_nocb_bypass_wakeup_timer, 0);
|
|
|
|
rcu_cblist_init(&rdp->nocb_bypass);
|
2012-08-19 21:35:53 -07:00
|
|
|
}
|
|
|
|
|
2014-07-11 11:30:24 -07:00
|
|
|
/*
|
|
|
|
* If the specified CPU is a no-CBs CPU that does not already have its
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
* rcuo CB kthread, spawn it. Additionally, if the rcuo GP kthread
|
|
|
|
* for this CPU's group has not yet been created, spawn it as well.
|
2014-07-11 11:30:24 -07:00
|
|
|
*/
|
2018-07-03 17:22:34 -07:00
|
|
|
static void rcu_spawn_one_nocb_kthread(int cpu)
|
2014-07-11 11:30:24 -07:00
|
|
|
{
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
|
|
|
|
struct rcu_data *rdp_gp;
|
2014-07-11 11:30:24 -07:00
|
|
|
struct task_struct *t;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this isn't a no-CBs CPU or if it already has an rcuo kthread,
|
|
|
|
* then nothing to do.
|
|
|
|
*/
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
if (!rcu_is_nocb_cpu(cpu) || rdp->nocb_cb_kthread)
|
2014-07-11 11:30:24 -07:00
|
|
|
return;
|
|
|
|
|
2019-03-28 15:44:18 -07:00
|
|
|
/* If we didn't spawn the GP kthread first, reorganize! */
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
rdp_gp = rdp->nocb_gp_rdp;
|
|
|
|
if (!rdp_gp->nocb_gp_kthread) {
|
|
|
|
t = kthread_run(rcu_nocb_gp_kthread, rdp_gp,
|
|
|
|
"rcuog/%d", rdp_gp->cpu);
|
|
|
|
if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo GP kthread, OOM is now expected behavior\n", __func__))
|
|
|
|
return;
|
|
|
|
WRITE_ONCE(rdp_gp->nocb_gp_kthread, t);
|
2014-07-11 11:30:24 -07:00
|
|
|
}
|
|
|
|
|
2018-07-07 18:12:26 -07:00
|
|
|
/* Spawn the kthread for this CPU. */
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
t = kthread_run(rcu_nocb_cb_kthread, rdp,
|
2018-07-03 17:22:34 -07:00
|
|
|
"rcuo%c/%d", rcu_state.abbr, cpu);
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo CB kthread, OOM is now expected behavior\n", __func__))
|
2018-10-22 08:26:00 -07:00
|
|
|
return;
|
rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups. The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks. The non-leader rcuo kthreads
only invoke callbacks.
This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks. However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up. This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.
One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.
This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks. Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group. Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.
This change does introduce additional kthreads, however:
1. The number of additional kthreads is about the square root of
the number of CPUs, so that a 4096-CPU system would have only
about 64 additional kthreads. Note that recent changes
decreased the number of rcuo kthreads by a factor of two
(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
this still represents a significant improvement on most systems.
2. The leading "rcuo" of the rcuog kthreads should allow existing
scripting to affinity these additional kthreads as needed, the
same as for the rcuop and rcuos kthreads. (There are no longer
any rcuob kthreads.)
3. A state-machine approach was considered and rejected. Although
this would allow the rcuo kthreads to continue their dual
leader/follower roles, it complicates callback invocation
and makes it more difficult to consolidate rcuo callback
invocation with existing softirq callback invocation.
The introduction of rcuog kthreads should thus be acceptable.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-03-29 16:43:51 -07:00
|
|
|
WRITE_ONCE(rdp->nocb_cb_kthread, t);
|
|
|
|
WRITE_ONCE(rdp->nocb_gp_kthread, rdp_gp->nocb_gp_kthread);
|
2014-07-11 11:30:24 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the specified CPU is a no-CBs CPU that does not already have its
|
2018-11-27 13:55:53 -08:00
|
|
|
* rcuo kthread, spawn it.
|
2014-07-11 11:30:24 -07:00
|
|
|
*/
|
2018-11-27 13:55:53 -08:00
|
|
|
static void rcu_spawn_cpu_nocb_kthread(int cpu)
|
2014-07-11 11:30:24 -07:00
|
|
|
{
|
|
|
|
if (rcu_scheduler_fully_active)
|
2018-07-04 15:35:00 -07:00
|
|
|
rcu_spawn_one_nocb_kthread(cpu);
|
2014-07-11 11:30:24 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Once the scheduler is running, spawn rcuo kthreads for all online
|
|
|
|
* no-CBs CPUs. This assumes that the early_initcall()s happen before
|
|
|
|
* non-boot CPUs come online -- if this changes, we will need to add
|
|
|
|
* some mutual exclusion.
|
|
|
|
*/
|
|
|
|
static void __init rcu_spawn_nocb_kthreads(void)
|
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
for_each_online_cpu(cpu)
|
2018-11-27 13:55:53 -08:00
|
|
|
rcu_spawn_cpu_nocb_kthread(cpu);
|
2014-07-11 11:30:24 -07:00
|
|
|
}
|
|
|
|
|
2019-03-28 15:44:18 -07:00
|
|
|
/* How many CB CPU IDs per GP kthread? Default of -1 for sqrt(nr_cpu_ids). */
|
2019-04-02 08:05:55 -07:00
|
|
|
static int rcu_nocb_gp_stride = -1;
|
|
|
|
module_param(rcu_nocb_gp_stride, int, 0444);
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
|
|
|
|
/*
|
2019-03-28 15:44:18 -07:00
|
|
|
* Initialize GP-CB relationships for all no-CBs CPU.
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
*/
|
2018-07-03 17:22:34 -07:00
|
|
|
static void __init rcu_organize_nocb_kthreads(void)
|
2012-08-19 21:35:53 -07:00
|
|
|
{
|
|
|
|
int cpu;
|
2019-06-01 05:12:36 -07:00
|
|
|
bool firsttime = true;
|
2019-10-04 19:49:10 +00:00
|
|
|
bool gotnocbs = false;
|
|
|
|
bool gotnocbscbs = true;
|
2019-04-02 08:05:55 -07:00
|
|
|
int ls = rcu_nocb_gp_stride;
|
2019-03-28 15:44:18 -07:00
|
|
|
int nl = 0; /* Next GP kthread. */
|
2012-08-19 21:35:53 -07:00
|
|
|
struct rcu_data *rdp;
|
2019-03-31 16:20:52 -07:00
|
|
|
struct rcu_data *rdp_gp = NULL; /* Suppress misguided gcc warn. */
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
struct rcu_data *rdp_prev = NULL;
|
2012-08-19 21:35:53 -07:00
|
|
|
|
2017-11-17 21:40:15 +06:00
|
|
|
if (!cpumask_available(rcu_nocb_mask))
|
2012-08-19 21:35:53 -07:00
|
|
|
return;
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
if (ls == -1) {
|
2019-06-01 05:14:47 -07:00
|
|
|
ls = nr_cpu_ids / int_sqrt(nr_cpu_ids);
|
2019-04-02 08:05:55 -07:00
|
|
|
rcu_nocb_gp_stride = ls;
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-01-02 14:24:24 -08:00
|
|
|
* Each pass through this loop sets up one rcu_data structure.
|
|
|
|
* Should the corresponding CPU come online in the future, then
|
|
|
|
* we will spawn the needed set of rcu_nocb_kthread() kthreads.
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
*/
|
2012-08-19 21:35:53 -07:00
|
|
|
for_each_cpu(cpu, rcu_nocb_mask) {
|
2018-07-03 15:37:16 -07:00
|
|
|
rdp = per_cpu_ptr(&rcu_data, cpu);
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
if (rdp->cpu >= nl) {
|
2019-03-28 15:44:18 -07:00
|
|
|
/* New GP kthread, set up for CBs & next GP. */
|
2019-10-04 19:49:10 +00:00
|
|
|
gotnocbs = true;
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls;
|
2019-03-28 15:33:59 -07:00
|
|
|
rdp->nocb_gp_rdp = rdp;
|
2019-03-31 16:20:52 -07:00
|
|
|
rdp_gp = rdp;
|
2019-10-04 19:49:10 +00:00
|
|
|
if (dump_tree) {
|
|
|
|
if (!firsttime)
|
|
|
|
pr_cont("%s\n", gotnocbscbs
|
|
|
|
? "" : " (self only)");
|
|
|
|
gotnocbscbs = false;
|
|
|
|
firsttime = false;
|
|
|
|
pr_alert("%s: No-CB GP kthread CPU %d:",
|
|
|
|
__func__, cpu);
|
|
|
|
}
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
} else {
|
2019-03-28 15:44:18 -07:00
|
|
|
/* Another CB kthread, link to previous GP kthread. */
|
2019-10-04 19:49:10 +00:00
|
|
|
gotnocbscbs = true;
|
2019-03-31 16:20:52 -07:00
|
|
|
rdp->nocb_gp_rdp = rdp_gp;
|
2019-03-28 15:33:59 -07:00
|
|
|
rdp_prev->nocb_next_cb_rdp = rdp;
|
2019-10-04 19:49:10 +00:00
|
|
|
if (dump_tree)
|
|
|
|
pr_cont(" %d", cpu);
|
rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things. This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.
To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers. By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders. In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.
For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period. This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-24 09:26:11 -07:00
|
|
|
}
|
|
|
|
rdp_prev = rdp;
|
2012-08-19 21:35:53 -07:00
|
|
|
}
|
2019-10-04 19:49:10 +00:00
|
|
|
if (gotnocbs && dump_tree)
|
|
|
|
pr_cont("%s\n", gotnocbscbs ? "" : " (self only)");
|
2012-08-19 21:35:53 -07:00
|
|
|
}
|
|
|
|
|
2018-09-21 18:08:09 -07:00
|
|
|
/*
|
|
|
|
* Bind the current task to the offloaded CPUs. If there are no offloaded
|
|
|
|
* CPUs, leave the task unbound. Splat if the bind attempt fails.
|
|
|
|
*/
|
|
|
|
void rcu_bind_current_to_nocb(void)
|
|
|
|
{
|
|
|
|
if (cpumask_available(rcu_nocb_mask) && cpumask_weight(rcu_nocb_mask))
|
|
|
|
WARN_ON(sched_setaffinity(current->pid, rcu_nocb_mask));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(rcu_bind_current_to_nocb);
|
|
|
|
|
2020-12-18 10:20:34 -08:00
|
|
|
// The ->on_cpu field is available only in CONFIG_SMP=y, so...
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
static char *show_rcu_should_be_on_cpu(struct task_struct *tsp)
|
|
|
|
{
|
|
|
|
return tsp && tsp->state == TASK_RUNNING && !tsp->on_cpu ? "!" : "";
|
|
|
|
}
|
|
|
|
#else // #ifdef CONFIG_SMP
|
|
|
|
static char *show_rcu_should_be_on_cpu(struct task_struct *tsp)
|
|
|
|
{
|
|
|
|
return "";
|
|
|
|
}
|
|
|
|
#endif // #else #ifdef CONFIG_SMP
|
|
|
|
|
2019-06-25 13:32:51 -07:00
|
|
|
/*
|
|
|
|
* Dump out nocb grace-period kthread state for the specified rcu_data
|
|
|
|
* structure.
|
|
|
|
*/
|
|
|
|
static void show_rcu_nocb_gp_state(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
struct rcu_node *rnp = rdp->mynode;
|
|
|
|
|
2020-12-18 10:20:34 -08:00
|
|
|
pr_info("nocb GP %d %c%c%c%c%c%c %c[%c%c] %c%c:%ld rnp %d:%d %lu %c CPU %d%s\n",
|
2019-06-25 13:32:51 -07:00
|
|
|
rdp->cpu,
|
|
|
|
"kK"[!!rdp->nocb_gp_kthread],
|
|
|
|
"lL"[raw_spin_is_locked(&rdp->nocb_gp_lock)],
|
|
|
|
"dD"[!!rdp->nocb_defer_wakeup],
|
|
|
|
"tT"[timer_pending(&rdp->nocb_timer)],
|
|
|
|
"bB"[timer_pending(&rdp->nocb_bypass_timer)],
|
|
|
|
"sS"[!!rdp->nocb_gp_sleep],
|
|
|
|
".W"[swait_active(&rdp->nocb_gp_wq)],
|
|
|
|
".W"[swait_active(&rnp->nocb_gp_wq[0])],
|
|
|
|
".W"[swait_active(&rnp->nocb_gp_wq[1])],
|
|
|
|
".B"[!!rdp->nocb_gp_bypass],
|
|
|
|
".G"[!!rdp->nocb_gp_gp],
|
|
|
|
(long)rdp->nocb_gp_seq,
|
2020-12-18 10:20:34 -08:00
|
|
|
rnp->grplo, rnp->grphi, READ_ONCE(rdp->nocb_gp_loops),
|
|
|
|
rdp->nocb_gp_kthread ? task_state_to_char(rdp->nocb_gp_kthread) : '.',
|
|
|
|
rdp->nocb_cb_kthread ? (int)task_cpu(rdp->nocb_gp_kthread) : -1,
|
|
|
|
show_rcu_should_be_on_cpu(rdp->nocb_cb_kthread));
|
2019-06-25 13:32:51 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Dump out nocb kthread state for the specified rcu_data structure. */
|
|
|
|
static void show_rcu_nocb_state(struct rcu_data *rdp)
|
|
|
|
{
|
2020-12-18 10:20:34 -08:00
|
|
|
char bufw[20];
|
|
|
|
char bufr[20];
|
2019-06-25 13:32:51 -07:00
|
|
|
struct rcu_segcblist *rsclp = &rdp->cblist;
|
|
|
|
bool waslocked;
|
|
|
|
bool wastimer;
|
|
|
|
bool wassleep;
|
|
|
|
|
|
|
|
if (rdp->nocb_gp_rdp == rdp)
|
|
|
|
show_rcu_nocb_gp_state(rdp);
|
|
|
|
|
2020-12-18 10:20:34 -08:00
|
|
|
sprintf(bufw, "%ld", rsclp->gp_seq[RCU_WAIT_TAIL]);
|
|
|
|
sprintf(bufr, "%ld", rsclp->gp_seq[RCU_NEXT_READY_TAIL]);
|
|
|
|
pr_info(" CB %d^%d->%d %c%c%c%c%c%c F%ld L%ld C%d %c%c%s%c%s%c%c q%ld %c CPU %d%s\n",
|
2019-06-25 13:32:51 -07:00
|
|
|
rdp->cpu, rdp->nocb_gp_rdp->cpu,
|
2020-12-18 13:17:37 -08:00
|
|
|
rdp->nocb_next_cb_rdp ? rdp->nocb_next_cb_rdp->cpu : -1,
|
2019-06-25 13:32:51 -07:00
|
|
|
"kK"[!!rdp->nocb_cb_kthread],
|
|
|
|
"bB"[raw_spin_is_locked(&rdp->nocb_bypass_lock)],
|
|
|
|
"cC"[!!atomic_read(&rdp->nocb_lock_contended)],
|
|
|
|
"lL"[raw_spin_is_locked(&rdp->nocb_lock)],
|
|
|
|
"sS"[!!rdp->nocb_cb_sleep],
|
|
|
|
".W"[swait_active(&rdp->nocb_cb_wq)],
|
|
|
|
jiffies - rdp->nocb_bypass_first,
|
|
|
|
jiffies - rdp->nocb_nobypass_last,
|
|
|
|
rdp->nocb_nobypass_count,
|
|
|
|
".D"[rcu_segcblist_ready_cbs(rsclp)],
|
2020-12-18 10:20:34 -08:00
|
|
|
".W"[!rcu_segcblist_segempty(rsclp, RCU_WAIT_TAIL)],
|
|
|
|
rcu_segcblist_segempty(rsclp, RCU_WAIT_TAIL) ? "" : bufw,
|
|
|
|
".R"[!rcu_segcblist_segempty(rsclp, RCU_NEXT_READY_TAIL)],
|
|
|
|
rcu_segcblist_segempty(rsclp, RCU_NEXT_READY_TAIL) ? "" : bufr,
|
|
|
|
".N"[!rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL)],
|
2019-06-25 13:32:51 -07:00
|
|
|
".B"[!!rcu_cblist_n_cbs(&rdp->nocb_bypass)],
|
2020-12-18 10:20:34 -08:00
|
|
|
rcu_segcblist_n_cbs(&rdp->cblist),
|
|
|
|
rdp->nocb_cb_kthread ? task_state_to_char(rdp->nocb_cb_kthread) : '.',
|
|
|
|
rdp->nocb_cb_kthread ? (int)task_cpu(rdp->nocb_gp_kthread) : -1,
|
|
|
|
show_rcu_should_be_on_cpu(rdp->nocb_cb_kthread));
|
2019-06-25 13:32:51 -07:00
|
|
|
|
|
|
|
/* It is OK for GP kthreads to have GP state. */
|
|
|
|
if (rdp->nocb_gp_rdp == rdp)
|
|
|
|
return;
|
|
|
|
|
|
|
|
waslocked = raw_spin_is_locked(&rdp->nocb_gp_lock);
|
2020-06-22 16:46:43 -07:00
|
|
|
wastimer = timer_pending(&rdp->nocb_bypass_timer);
|
2019-06-25 13:32:51 -07:00
|
|
|
wassleep = swait_active(&rdp->nocb_gp_wq);
|
2020-06-22 16:46:43 -07:00
|
|
|
if (!rdp->nocb_gp_sleep && !waslocked && !wastimer && !wassleep)
|
2019-06-25 13:32:51 -07:00
|
|
|
return; /* Nothing untowards. */
|
|
|
|
|
2020-06-22 09:25:34 -07:00
|
|
|
pr_info(" nocb GP activity on CB-only CPU!!! %c%c%c%c %c\n",
|
2019-06-25 13:32:51 -07:00
|
|
|
"lL"[waslocked],
|
|
|
|
"dD"[!!rdp->nocb_defer_wakeup],
|
|
|
|
"tT"[wastimer],
|
|
|
|
"sS"[!!rdp->nocb_gp_sleep],
|
|
|
|
".W"[wassleep]);
|
|
|
|
}
|
|
|
|
|
2013-01-07 13:37:42 -08:00
|
|
|
#else /* #ifdef CONFIG_RCU_NOCB_CPU */
|
|
|
|
|
2019-05-15 09:56:40 -07:00
|
|
|
/* No ->nocb_lock to acquire. */
|
|
|
|
static void rcu_nocb_lock(struct rcu_data *rdp)
|
rcu: Make rcu_barrier() understand about missing rcuo kthreads
Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online. This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.
It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix. The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.
It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread. This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks. It is therefore required to wait even for those callbacks
that cannot possibly be invoked. Even if doing so hangs the system.
Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case. Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().
So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.
Reported-by: Yanko Kaneti <yaneti@declera.com>
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eric B Munson <emunson@akamai.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Eric B Munson <emunson@akamai.com>
Tested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Yanko Kaneti <yaneti@declera.com>
Tested-by: Kevin Fenzi <kevin@scrye.com>
Tested-by: Meelis Roos <mroos@linux.ee>
2014-10-27 09:15:54 -07:00
|
|
|
{
|
2019-05-15 09:56:40 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* No ->nocb_lock to release. */
|
|
|
|
static void rcu_nocb_unlock(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
/* No ->nocb_lock to release. */
|
|
|
|
static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
|
|
|
|
unsigned long flags)
|
|
|
|
{
|
|
|
|
local_irq_restore(flags);
|
rcu: Make rcu_barrier() understand about missing rcuo kthreads
Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online. This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.
It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix. The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.
It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread. This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks. It is therefore required to wait even for those callbacks
that cannot possibly be invoked. Even if doing so hangs the system.
Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case. Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().
So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.
Reported-by: Yanko Kaneti <yaneti@declera.com>
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eric B Munson <emunson@akamai.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Eric B Munson <emunson@akamai.com>
Tested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Yanko Kaneti <yaneti@declera.com>
Tested-by: Kevin Fenzi <kevin@scrye.com>
Tested-by: Meelis Roos <mroos@linux.ee>
2014-10-27 09:15:54 -07:00
|
|
|
}
|
|
|
|
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
/* Lockdep check that ->cblist may be safely accessed. */
|
|
|
|
static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
lockdep_assert_irqs_disabled();
|
|
|
|
}
|
|
|
|
|
2016-02-19 09:46:41 +01:00
|
|
|
static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq)
|
2012-08-19 21:35:53 -07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2016-02-19 09:46:41 +01:00
|
|
|
static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp)
|
2016-02-19 09:46:40 +01:00
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2013-02-10 20:48:58 -08:00
|
|
|
static void rcu_init_one_nocb(struct rcu_node *rnp)
|
|
|
|
{
|
|
|
|
}
|
2012-08-19 21:35:53 -07:00
|
|
|
|
rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations. However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking. Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.
This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure. Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.
The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first. During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations. However,
a CPU could simply stop invoking call_rcu() at any time. The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first). This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is
used to provide the needed wakeups.
[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-07-02 16:03:33 -07:00
|
|
|
static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
|
|
|
|
unsigned long j)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
|
|
|
|
bool *was_alldone, unsigned long flags)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2019-05-15 09:56:40 -07:00
|
|
|
static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
|
|
|
|
unsigned long flags)
|
2012-08-19 21:35:53 -07:00
|
|
|
{
|
2019-05-15 09:56:40 -07:00
|
|
|
WARN_ON_ONCE(1); /* Should be dead code! */
|
2012-08-19 21:35:53 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2014-07-29 14:50:47 -07:00
|
|
|
static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2021-02-01 00:05:46 +01:00
|
|
|
static bool do_nocb_deferred_wakeup(struct rcu_data *rdp)
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
{
|
2021-02-01 00:05:46 +01:00
|
|
|
return false;
|
rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.12.0-rc3+ #92 Not tainted
> -------------------------------------------------------
> trinity-child2/15191 is trying to acquire lock:
> (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
> (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
> [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
> [<ffffffff81732052>] __schedule+0x1d2/0xa20
> [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
> [<ffffffff817352b6>] retint_kernel+0x26/0x30
> [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
> [<ffffffff813f0504>] pty_write+0x54/0x60
> [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
> [<ffffffff813e5838>] tty_write+0x158/0x2d0
> [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
> [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
> [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
> [<ffffffff81054336>] do_fork+0x126/0x460
> [<ffffffff81054696>] kernel_thread+0x26/0x30
> [<ffffffff8171ff93>] rest_init+0x23/0x140
> [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
> [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
> [<ffffffff81097d62>] default_wake_function+0x12/0x20
> [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
> [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
> [<ffffffff8108ff59>] __wake_up+0x39/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111b8d>] call_rcu+0x1d/0x20
> [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
> [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
> [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
> [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
> [<ffffffff817200be>] kernel_init+0xe/0x190
> [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
> &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&ctx->lock);
> lock(&rq->lock);
> lock(&ctx->lock);
> lock(&rdp->nocb_wq);
>
> *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
> #0: (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
> ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
> ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
> ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
> [<ffffffff8172a363>] dump_stack+0x4e/0x82
> [<ffffffff81726741>] print_circular_bug+0x200/0x20f
> [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
> [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
> [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
> [<ffffffff810cc243>] lock_acquire+0x93/0x200
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
> [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
> [<ffffffff8108ff43>] __wake_up+0x23/0x50
> [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
> [<ffffffff81111450>] __call_rcu+0x140/0x820
> [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
> [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
> [<ffffffff81149abf>] put_ctx+0x4f/0x70
> [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
> [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
> [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
> [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
> [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks. The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.
One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held. Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:
1. Determine when it is likely that a relevant scheduler lock is held.
2. Defer the wakeup in such cases.
3. Ensure that all deferred wakeups eventually happen, preferably
sooner rather than later.
We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held. This works because the relevant locks are always acquired
with interrupts disabled. We may defer more often than needed, but that
is at least safe.
The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
This flag is checked by the RCU core processing. The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set. Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!). So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.
This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-10-04 14:33:34 -07:00
|
|
|
}
|
|
|
|
|
2018-11-27 13:55:53 -08:00
|
|
|
static void rcu_spawn_cpu_nocb_kthread(int cpu)
|
2014-07-11 11:30:24 -07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __init rcu_spawn_nocb_kthreads(void)
|
2012-08-19 21:35:53 -07:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2019-06-25 13:32:51 -07:00
|
|
|
static void show_rcu_nocb_state(struct rcu_data *rdp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2012-08-19 21:35:53 -07:00
|
|
|
#endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
|
2013-04-12 16:19:10 -07:00
|
|
|
|
2013-11-08 09:03:10 -08:00
|
|
|
/*
|
|
|
|
* Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
|
|
|
|
* grace-period kthread will do force_quiescent_state() processing?
|
|
|
|
* The idea is to avoid waking up RCU core processing on such a
|
|
|
|
* CPU unless the grace period has extended for too long.
|
|
|
|
*
|
|
|
|
* This code relies on the fact that all NO_HZ_FULL CPUs are also
|
2014-02-09 14:35:11 +01:00
|
|
|
* CONFIG_RCU_NOCB_CPU CPUs.
|
2013-11-08 09:03:10 -08:00
|
|
|
*/
|
2018-07-03 17:22:34 -07:00
|
|
|
static bool rcu_nohz_full_cpu(void)
|
2013-11-08 09:03:10 -08:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_NO_HZ_FULL
|
|
|
|
if (tick_nohz_full_cpu(smp_processor_id()) &&
|
2018-07-03 17:22:34 -07:00
|
|
|
(!rcu_gp_in_progress() ||
|
2020-04-10 17:05:22 -07:00
|
|
|
time_before(jiffies, READ_ONCE(rcu_state.gp_start) + HZ)))
|
2015-03-30 16:46:16 -07:00
|
|
|
return true;
|
2013-11-08 09:03:10 -08:00
|
|
|
#endif /* #ifdef CONFIG_NO_HZ_FULL */
|
2015-03-30 16:46:16 -07:00
|
|
|
return false;
|
2013-11-08 09:03:10 -08:00
|
|
|
}
|
2014-04-01 11:20:36 -07:00
|
|
|
|
|
|
|
/*
|
2018-03-19 11:53:22 -07:00
|
|
|
* Bind the RCU grace-period kthreads to the housekeeping CPU.
|
2014-04-01 11:20:36 -07:00
|
|
|
*/
|
|
|
|
static void rcu_bind_gp_kthread(void)
|
|
|
|
{
|
2014-06-04 13:46:03 -07:00
|
|
|
if (!tick_nohz_full_enabled())
|
2014-04-01 11:20:36 -07:00
|
|
|
return;
|
2017-10-27 04:42:35 +02:00
|
|
|
housekeeping_affine(current, HK_FLAG_RCU);
|
2014-04-01 11:20:36 -07:00
|
|
|
}
|
2014-08-04 17:43:50 -07:00
|
|
|
|
|
|
|
/* Record the current task on dyntick-idle entry. */
|
2020-03-13 17:32:17 +01:00
|
|
|
static void noinstr rcu_dynticks_task_enter(void)
|
2014-08-04 17:43:50 -07:00
|
|
|
{
|
|
|
|
#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
|
2015-03-03 14:57:58 -08:00
|
|
|
WRITE_ONCE(current->rcu_tasks_idle_cpu, smp_processor_id());
|
2014-08-04 17:43:50 -07:00
|
|
|
#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Record no current task on dyntick-idle exit. */
|
2020-03-13 17:32:17 +01:00
|
|
|
static void noinstr rcu_dynticks_task_exit(void)
|
2014-08-04 17:43:50 -07:00
|
|
|
{
|
|
|
|
#if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL)
|
2015-03-03 14:57:58 -08:00
|
|
|
WRITE_ONCE(current->rcu_tasks_idle_cpu, -1);
|
2014-08-04 17:43:50 -07:00
|
|
|
#endif /* #if defined(CONFIG_TASKS_RCU) && defined(CONFIG_NO_HZ_FULL) */
|
|
|
|
}
|
rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks. Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode. Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.
Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU. This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task. And that task could switch from one CPU to
another at any time.
This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn. If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.
With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-19 15:33:12 -07:00
|
|
|
|
|
|
|
/* Turn on heavyweight RCU tasks trace readers on idle/user entry. */
|
|
|
|
static void rcu_dynticks_task_trace_enter(void)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_TASKS_RCU_TRACE
|
|
|
|
if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
|
|
|
|
current->trc_reader_special.b.need_mb = true;
|
|
|
|
#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Turn off heavyweight RCU tasks trace readers on idle/user exit. */
|
|
|
|
static void rcu_dynticks_task_trace_exit(void)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_TASKS_RCU_TRACE
|
|
|
|
if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
|
|
|
|
current->trc_reader_special.b.need_mb = false;
|
|
|
|
#endif /* #ifdef CONFIG_TASKS_RCU_TRACE */
|
|
|
|
}
|