cgroup: Changes for v6.8

- Yafang Shao added task_get_cgroup1() helper to enable a similar BPF helper
   so that BPF progs can be more useful on cgroup1 hierarchies. While cgroup1
   is mostly in maintenance mode, this addition is very small while having an
   outsized usefulness for users who are still on cgroup1. Yafang also
   optimized root cgroup list access by making it RCU protected in the
   process.
 
 - Waiman Long optimized rstat operation leading to substantially lower and
   more consistent lock hold time while flushing the hierarchical statistics.
   As the lock can be acquired briefly in various hot paths, this reduction
   has cascading benefits.
 
 - Waiman also improved the quality of isolation for cpuset's isolated
   partitions. CPUs which are allocated to isolated partitions are now
   excluded from running unbound work items and cpu_is_isolated() test which
   is used by vmstat and memcg to reduce interference now includes cpuset
   isolated CPUs. While it isn't there yet, the hope is eventually reaching
   parity with the isolation level provided by the `isolcpus` boot param but
   in a dynamic manner.
 
   This involved a couple workqueue patches which were applied directly to
   cgroup/for-6.8 rather than ping-ponged through the wq tree. This was
   because the wq code change was small and the area is usually very static
   and unlikely to cause conflicts. However, luck had it that there was a wq
   bug fix in the area during the 6.7 cycle which caused a conflict. The
   conflict is contextual but can be a bit confusing to resolve, so there is
   one merge from wq/for-6.7-fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZYnuJg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGQ5kAP9nMMWqi+R1HeG7+hWROTVjQZ0OM9KRcpZ1TmjF
 FNbkJgEAzt+sPnoWwYDTSI7pkNeZ/IM7x1qkkKGvENNtUXrz0Ac=
 =PyYN
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Yafang Shao added task_get_cgroup1() helper to enable a similar BPF
   helper so that BPF progs can be more useful on cgroup1 hierarchies.
   While cgroup1 is mostly in maintenance mode, this addition is very
   small while having an outsized usefulness for users who are still on
   cgroup1. Yafang also optimized root cgroup list access by making it
   RCU protected in the process.

 - Waiman Long optimized rstat operation leading to substantially lower
   and more consistent lock hold time while flushing the hierarchical
   statistics. As the lock can be acquired briefly in various hot paths,
   this reduction has cascading benefits.

 - Waiman also improved the quality of isolation for cpuset's isolated
   partitions. CPUs which are allocated to isolated partitions are now
   excluded from running unbound work items and cpu_is_isolated() test
   which is used by vmstat and memcg to reduce interference now includes
   cpuset isolated CPUs. While it isn't there yet, the hope is
   eventually reaching parity with the isolation level provided by the
   `isolcpus` boot param but in a dynamic manner.

* tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Move rcu_head up near the top of cgroup_root
  cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check
  cgroup: Avoid false cacheline sharing of read mostly rstat_cpu
  cgroup/rstat: Optimize cgroup_rstat_updated_list()
  cgroup: Fix documentation for cpu.idle
  cgroup/cpuset: Expose cpuset.cpus.isolated
  workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS
  cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
  cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
  cgroup/cpuset: Keep track of CPUs in isolated partitions
  selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh
  workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
  selftests: cgroup: Fixes a typo in a comment
  cgroup: Add a new helper for cgroup1 hierarchy
  cgroup: Add annotation for holding namespace_sem in current_cgns_cgroup_from_root()
  cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show()
  cgroup: Make operations on the cgroup root_list RCU safe
  cgroup: Remove unnecessary list_empty()
This commit is contained in:
Linus Torvalds 2024-01-08 20:04:02 -08:00
commit 9f8413c4a6
14 changed files with 708 additions and 283 deletions

View File

@ -1093,7 +1093,11 @@ All time durations are in microseconds.
A read-write single value file which exists on non-root A read-write single value file which exists on non-root
cgroups. The default is "100". cgroups. The default is "100".
The weight in the range [1, 10000]. For non idle groups (cpu.idle = 0), the weight is in the
range [1, 10000].
If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1),
then the weight will show as a 0.
cpu.weight.nice cpu.weight.nice
A read-write single value file which exists on non-root A read-write single value file which exists on non-root
@ -1157,6 +1161,16 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This maximum utilization values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp. value is used to clamp the task specific maximum utilization clamp.
cpu.idle
A read-write single value file which exists on non-root cgroups.
The default is 0.
This is the cgroup analog of the per-task SCHED_IDLE sched policy.
Setting this value to a 1 will make the scheduling policy of the
cgroup SCHED_IDLE. The threads inside the cgroup will retain their
own relative priorities, but the cgroup itself will be treated as
very low priority relative to its peers.
Memory Memory
@ -2316,6 +2330,13 @@ Cpuset Interface Files
treated to have an implicit value of "cpuset.cpus" in the treated to have an implicit value of "cpuset.cpus" in the
formation of local partition. formation of local partition.
cpuset.cpus.isolated
A read-only and root cgroup only multiple values file.
This file shows the set of all isolated CPUs used in existing
isolated partitions. It will be empty if no isolated partition
is created.
cpuset.cpus.partition cpuset.cpus.partition
A read-write single value file which exists on non-root A read-write single value file which exists on non-root
cpuset-enabled cgroups. This flag is owned by the parent cgroup cpuset-enabled cgroups. This flag is owned by the parent cgroup
@ -2358,11 +2379,11 @@ Cpuset Interface Files
partition or scheduling domain. The set of exclusive CPUs is partition or scheduling domain. The set of exclusive CPUs is
determined by the value of its "cpuset.cpus.exclusive.effective". determined by the value of its "cpuset.cpus.exclusive.effective".
When set to "isolated", the CPUs in that partition will When set to "isolated", the CPUs in that partition will be in
be in an isolated state without any load balancing from the an isolated state without any load balancing from the scheduler
scheduler. Tasks placed in such a partition with multiple and excluded from the unbound workqueues. Tasks placed in such
CPUs should be carefully distributed and bound to each of the a partition with multiple CPUs should be carefully distributed
individual CPUs for optimal performance. and bound to each of the individual CPUs for optimal performance.
A partition root ("root" or "isolated") can be in one of the A partition root ("root" or "isolated") can be in one of the
two possible states - valid or invalid. An invalid partition two possible states - valid or invalid. An invalid partition

View File

@ -496,6 +496,20 @@ struct cgroup {
struct cgroup_rstat_cpu __percpu *rstat_cpu; struct cgroup_rstat_cpu __percpu *rstat_cpu;
struct list_head rstat_css_list; struct list_head rstat_css_list;
/*
* Add padding to separate the read mostly rstat_cpu and
* rstat_css_list into a different cacheline from the following
* rstat_flush_next and *bstat fields which can have frequent updates.
*/
CACHELINE_PADDING(_pad_);
/*
* A singly-linked list of cgroup structures to be rstat flushed.
* This is a scratch field to be used exclusively by
* cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
*/
struct cgroup *rstat_flush_next;
/* cgroup basic resource statistics */ /* cgroup basic resource statistics */
struct cgroup_base_stat last_bstat; struct cgroup_base_stat last_bstat;
struct cgroup_base_stat bstat; struct cgroup_base_stat bstat;
@ -548,6 +562,10 @@ struct cgroup_root {
/* Unique id for this hierarchy. */ /* Unique id for this hierarchy. */
int hierarchy_id; int hierarchy_id;
/* A list running through the active hierarchies */
struct list_head root_list;
struct rcu_head rcu; /* Must be near the top */
/* /*
* The root cgroup. The containing cgroup_root will be destroyed on its * The root cgroup. The containing cgroup_root will be destroyed on its
* release. cgrp->ancestors[0] will be used overflowing into the * release. cgrp->ancestors[0] will be used overflowing into the
@ -561,9 +579,6 @@ struct cgroup_root {
/* Number of cgroups in the hierarchy, used only for /proc/cgroups */ /* Number of cgroups in the hierarchy, used only for /proc/cgroups */
atomic_t nr_cgrps; atomic_t nr_cgrps;
/* A list running through the active hierarchies */
struct list_head root_list;
/* Hierarchy-specific flags */ /* Hierarchy-specific flags */
unsigned int flags; unsigned int flags;

View File

@ -69,6 +69,7 @@ struct css_task_iter {
extern struct file_system_type cgroup_fs_type; extern struct file_system_type cgroup_fs_type;
extern struct cgroup_root cgrp_dfl_root; extern struct cgroup_root cgrp_dfl_root;
extern struct css_set init_css_set; extern struct css_set init_css_set;
extern spinlock_t css_set_lock;
#define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys; #define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys;
#include <linux/cgroup_subsys.h> #include <linux/cgroup_subsys.h>
@ -386,7 +387,6 @@ static inline void cgroup_unlock(void)
* as locks used during the cgroup_subsys::attach() methods. * as locks used during the cgroup_subsys::attach() methods.
*/ */
#ifdef CONFIG_PROVE_RCU #ifdef CONFIG_PROVE_RCU
extern spinlock_t css_set_lock;
#define task_css_set_check(task, __c) \ #define task_css_set_check(task, __c) \
rcu_dereference_check((task)->cgroups, \ rcu_dereference_check((task)->cgroups, \
rcu_read_lock_sched_held() || \ rcu_read_lock_sched_held() || \
@ -853,4 +853,6 @@ static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
#endif /* CONFIG_CGROUP_BPF */ #endif /* CONFIG_CGROUP_BPF */
struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id);
#endif /* _LINUX_CGROUP_H */ #endif /* _LINUX_CGROUP_H */

View File

@ -77,6 +77,7 @@ extern void cpuset_lock(void);
extern void cpuset_unlock(void); extern void cpuset_unlock(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p); extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
extern bool cpuset_cpu_is_isolated(int cpu);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p); extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed) #define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void); void cpuset_init_current_mems_allowed(void);
@ -207,6 +208,11 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
return false; return false;
} }
static inline bool cpuset_cpu_is_isolated(int cpu)
{
return false;
}
static inline nodemask_t cpuset_mems_allowed(struct task_struct *p) static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{ {
return node_possible_map; return node_possible_map;

View File

@ -2,6 +2,7 @@
#define _LINUX_SCHED_ISOLATION_H #define _LINUX_SCHED_ISOLATION_H
#include <linux/cpumask.h> #include <linux/cpumask.h>
#include <linux/cpuset.h>
#include <linux/init.h> #include <linux/init.h>
#include <linux/tick.h> #include <linux/tick.h>
@ -67,7 +68,8 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
static inline bool cpu_is_isolated(int cpu) static inline bool cpu_is_isolated(int cpu)
{ {
return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) || return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
!housekeeping_test_cpu(cpu, HK_TYPE_TICK); !housekeeping_test_cpu(cpu, HK_TYPE_TICK) ||
cpuset_cpu_is_isolated(cpu);
} }
#endif /* _LINUX_SCHED_ISOLATION_H */ #endif /* _LINUX_SCHED_ISOLATION_H */

View File

@ -491,7 +491,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void);
void free_workqueue_attrs(struct workqueue_attrs *attrs); void free_workqueue_attrs(struct workqueue_attrs *attrs);
int apply_workqueue_attrs(struct workqueue_struct *wq, int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs); const struct workqueue_attrs *attrs);
int workqueue_set_unbound_cpumask(cpumask_var_t cpumask); extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask);
extern bool queue_work_on(int cpu, struct workqueue_struct *wq, extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work); struct work_struct *work);

View File

@ -164,13 +164,13 @@ struct cgroup_mgctx {
#define DEFINE_CGROUP_MGCTX(name) \ #define DEFINE_CGROUP_MGCTX(name) \
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name) struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
extern spinlock_t css_set_lock;
extern struct cgroup_subsys *cgroup_subsys[]; extern struct cgroup_subsys *cgroup_subsys[];
extern struct list_head cgroup_roots; extern struct list_head cgroup_roots;
/* iterate across the hierarchies */ /* iterate across the hierarchies */
#define for_each_root(root) \ #define for_each_root(root) \
list_for_each_entry((root), &cgroup_roots, root_list) list_for_each_entry_rcu((root), &cgroup_roots, root_list, \
lockdep_is_held(&cgroup_mutex))
/** /**
* for_each_subsys - iterate all enabled cgroup subsystems * for_each_subsys - iterate all enabled cgroup subsystems

View File

@ -1262,6 +1262,40 @@ int cgroup1_get_tree(struct fs_context *fc)
return ret; return ret;
} }
/**
* task_get_cgroup1 - Acquires the associated cgroup of a task within a
* specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its
* hierarchy ID.
* @tsk: The target task
* @hierarchy_id: The ID of a cgroup1 hierarchy
*
* On success, the cgroup is returned. On failure, ERR_PTR is returned.
* We limit it to cgroup1 only.
*/
struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id)
{
struct cgroup *cgrp = ERR_PTR(-ENOENT);
struct cgroup_root *root;
unsigned long flags;
rcu_read_lock();
for_each_root(root) {
/* cgroup1 only*/
if (root == &cgrp_dfl_root)
continue;
if (root->hierarchy_id != hierarchy_id)
continue;
spin_lock_irqsave(&css_set_lock, flags);
cgrp = task_cgroup_from_root(tsk, root);
if (!cgrp || !cgroup_tryget(cgrp))
cgrp = ERR_PTR(-ENOENT);
spin_unlock_irqrestore(&css_set_lock, flags);
break;
}
rcu_read_unlock();
return cgrp;
}
static int __init cgroup1_wq_init(void) static int __init cgroup1_wq_init(void)
{ {
/* /*

View File

@ -1315,7 +1315,7 @@ static void cgroup_exit_root_id(struct cgroup_root *root)
void cgroup_free_root(struct cgroup_root *root) void cgroup_free_root(struct cgroup_root *root)
{ {
kfree(root); kfree_rcu(root, rcu);
} }
static void cgroup_destroy_root(struct cgroup_root *root) static void cgroup_destroy_root(struct cgroup_root *root)
@ -1347,10 +1347,9 @@ static void cgroup_destroy_root(struct cgroup_root *root)
spin_unlock_irq(&css_set_lock); spin_unlock_irq(&css_set_lock);
if (!list_empty(&root->root_list)) { WARN_ON_ONCE(list_empty(&root->root_list));
list_del(&root->root_list); list_del_rcu(&root->root_list);
cgroup_root_count--; cgroup_root_count--;
}
if (!have_favordynmods) if (!have_favordynmods)
cgroup_favor_dynmods(root, false); cgroup_favor_dynmods(root, false);
@ -1390,7 +1389,15 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
} }
} }
BUG_ON(!res_cgroup); /*
* If cgroup_mutex is not held, the cgrp_cset_link will be freed
* before we remove the cgroup root from the root_list. Consequently,
* when accessing a cgroup root, the cset_link may have already been
* freed, resulting in a NULL res_cgroup. However, by holding the
* cgroup_mutex, we ensure that res_cgroup can't be NULL.
* If we don't hold cgroup_mutex in the caller, we must do the NULL
* check.
*/
return res_cgroup; return res_cgroup;
} }
@ -1413,6 +1420,11 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
rcu_read_unlock(); rcu_read_unlock();
/*
* The namespace_sem is held by current, so the root cgroup can't
* be umounted. Therefore, we can ensure that the res is non-NULL.
*/
WARN_ON_ONCE(!res);
return res; return res;
} }
@ -1449,7 +1461,6 @@ static struct cgroup *current_cgns_cgroup_dfl(void)
static struct cgroup *cset_cgroup_from_root(struct css_set *cset, static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
struct cgroup_root *root) struct cgroup_root *root)
{ {
lockdep_assert_held(&cgroup_mutex);
lockdep_assert_held(&css_set_lock); lockdep_assert_held(&css_set_lock);
return __cset_cgroup_from_root(cset, root); return __cset_cgroup_from_root(cset, root);
@ -1457,7 +1468,9 @@ static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
/* /*
* Return the cgroup for "task" from the given hierarchy. Must be * Return the cgroup for "task" from the given hierarchy. Must be
* called with cgroup_mutex and css_set_lock held. * called with css_set_lock held to prevent task's groups from being modified.
* Must be called with either cgroup_mutex or rcu read lock to prevent the
* cgroup root from being destroyed.
*/ */
struct cgroup *task_cgroup_from_root(struct task_struct *task, struct cgroup *task_cgroup_from_root(struct task_struct *task,
struct cgroup_root *root) struct cgroup_root *root)
@ -2032,7 +2045,7 @@ void init_cgroup_root(struct cgroup_fs_context *ctx)
struct cgroup_root *root = ctx->root; struct cgroup_root *root = ctx->root;
struct cgroup *cgrp = &root->cgrp; struct cgroup *cgrp = &root->cgrp;
INIT_LIST_HEAD(&root->root_list); INIT_LIST_HEAD_RCU(&root->root_list);
atomic_set(&root->nr_cgrps, 1); atomic_set(&root->nr_cgrps, 1);
cgrp->root = root; cgrp->root = root;
init_cgroup_housekeeping(cgrp); init_cgroup_housekeeping(cgrp);
@ -2115,7 +2128,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
* care of subsystems' refcounts, which are explicitly dropped in * care of subsystems' refcounts, which are explicitly dropped in
* the failure exit path. * the failure exit path.
*/ */
list_add(&root->root_list, &cgroup_roots); list_add_rcu(&root->root_list, &cgroup_roots);
cgroup_root_count++; cgroup_root_count++;
/* /*
@ -6265,7 +6278,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
if (!buf) if (!buf)
goto out; goto out;
cgroup_lock(); rcu_read_lock();
spin_lock_irq(&css_set_lock); spin_lock_irq(&css_set_lock);
for_each_root(root) { for_each_root(root) {
@ -6276,6 +6289,11 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
if (root == &cgrp_dfl_root && !READ_ONCE(cgrp_dfl_visible)) if (root == &cgrp_dfl_root && !READ_ONCE(cgrp_dfl_visible))
continue; continue;
cgrp = task_cgroup_from_root(tsk, root);
/* The root has already been unmounted. */
if (!cgrp)
continue;
seq_printf(m, "%d:", root->hierarchy_id); seq_printf(m, "%d:", root->hierarchy_id);
if (root != &cgrp_dfl_root) if (root != &cgrp_dfl_root)
for_each_subsys(ss, ssid) for_each_subsys(ss, ssid)
@ -6286,9 +6304,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
seq_printf(m, "%sname=%s", count ? "," : "", seq_printf(m, "%sname=%s", count ? "," : "",
root->name); root->name);
seq_putc(m, ':'); seq_putc(m, ':');
cgrp = task_cgroup_from_root(tsk, root);
/* /*
* On traditional hierarchies, all zombie tasks show up as * On traditional hierarchies, all zombie tasks show up as
* belonging to the root cgroup. On the default hierarchy, * belonging to the root cgroup. On the default hierarchy,
@ -6320,7 +6335,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
retval = 0; retval = 0;
out_unlock: out_unlock:
spin_unlock_irq(&css_set_lock); spin_unlock_irq(&css_set_lock);
cgroup_unlock(); rcu_read_unlock();
kfree(buf); kfree(buf);
out: out:
return retval; return retval;

View File

@ -25,6 +25,7 @@
#include <linux/cpu.h> #include <linux/cpu.h>
#include <linux/cpumask.h> #include <linux/cpumask.h>
#include <linux/cpuset.h> #include <linux/cpuset.h>
#include <linux/delay.h>
#include <linux/init.h> #include <linux/init.h>
#include <linux/interrupt.h> #include <linux/interrupt.h>
#include <linux/kernel.h> #include <linux/kernel.h>
@ -43,6 +44,7 @@
#include <linux/sched/isolation.h> #include <linux/sched/isolation.h>
#include <linux/cgroup.h> #include <linux/cgroup.h>
#include <linux/wait.h> #include <linux/wait.h>
#include <linux/workqueue.h>
DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@ -204,6 +206,11 @@ struct cpuset {
*/ */
static cpumask_var_t subpartitions_cpus; static cpumask_var_t subpartitions_cpus;
/*
* Exclusive CPUs in isolated partitions
*/
static cpumask_var_t isolated_cpus;
/* List of remote partition root children */ /* List of remote partition root children */
static struct list_head remote_children; static struct list_head remote_children;
@ -1317,6 +1324,7 @@ static void compute_effective_cpumask(struct cpumask *new_cpus,
*/ */
enum partition_cmd { enum partition_cmd {
partcmd_enable, /* Enable partition root */ partcmd_enable, /* Enable partition root */
partcmd_enablei, /* Enable isolated partition root */
partcmd_disable, /* Disable partition root */ partcmd_disable, /* Disable partition root */
partcmd_update, /* Update parent's effective_cpus */ partcmd_update, /* Update parent's effective_cpus */
partcmd_invalidate, /* Make partition invalid */ partcmd_invalidate, /* Make partition invalid */
@ -1418,6 +1426,109 @@ static void reset_partition_data(struct cpuset *cs)
} }
} }
/*
* partition_xcpus_newstate - Exclusive CPUs state change
* @old_prs: old partition_root_state
* @new_prs: new partition_root_state
* @xcpus: exclusive CPUs with state change
*/
static void partition_xcpus_newstate(int old_prs, int new_prs, struct cpumask *xcpus)
{
WARN_ON_ONCE(old_prs == new_prs);
if (new_prs == PRS_ISOLATED)
cpumask_or(isolated_cpus, isolated_cpus, xcpus);
else
cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
}
/*
* partition_xcpus_add - Add new exclusive CPUs to partition
* @new_prs: new partition_root_state
* @parent: parent cpuset
* @xcpus: exclusive CPUs to be added
* Return: true if isolated_cpus modified, false otherwise
*
* Remote partition if parent == NULL
*/
static bool partition_xcpus_add(int new_prs, struct cpuset *parent,
struct cpumask *xcpus)
{
bool isolcpus_updated;
WARN_ON_ONCE(new_prs < 0);
lockdep_assert_held(&callback_lock);
if (!parent)
parent = &top_cpuset;
if (parent == &top_cpuset)
cpumask_or(subpartitions_cpus, subpartitions_cpus, xcpus);
isolcpus_updated = (new_prs != parent->partition_root_state);
if (isolcpus_updated)
partition_xcpus_newstate(parent->partition_root_state, new_prs,
xcpus);
cpumask_andnot(parent->effective_cpus, parent->effective_cpus, xcpus);
return isolcpus_updated;
}
/*
* partition_xcpus_del - Remove exclusive CPUs from partition
* @old_prs: old partition_root_state
* @parent: parent cpuset
* @xcpus: exclusive CPUs to be removed
* Return: true if isolated_cpus modified, false otherwise
*
* Remote partition if parent == NULL
*/
static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
struct cpumask *xcpus)
{
bool isolcpus_updated;
WARN_ON_ONCE(old_prs < 0);
lockdep_assert_held(&callback_lock);
if (!parent)
parent = &top_cpuset;
if (parent == &top_cpuset)
cpumask_andnot(subpartitions_cpus, subpartitions_cpus, xcpus);
isolcpus_updated = (old_prs != parent->partition_root_state);
if (isolcpus_updated)
partition_xcpus_newstate(old_prs, parent->partition_root_state,
xcpus);
cpumask_and(xcpus, xcpus, cpu_active_mask);
cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus);
return isolcpus_updated;
}
static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
{
int ret;
lockdep_assert_cpus_held();
if (!isolcpus_updated)
return;
ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
WARN_ON_ONCE(ret < 0);
}
/**
* cpuset_cpu_is_isolated - Check if the given CPU is isolated
* @cpu: the CPU number to be checked
* Return: true if CPU is used in an isolated partition, false otherwise
*/
bool cpuset_cpu_is_isolated(int cpu)
{
return cpumask_test_cpu(cpu, isolated_cpus);
}
EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
/* /*
* compute_effective_exclusive_cpumask - compute effective exclusive CPUs * compute_effective_exclusive_cpumask - compute effective exclusive CPUs
* @cs: cpuset * @cs: cpuset
@ -1456,14 +1567,18 @@ static inline bool is_local_partition(struct cpuset *cs)
/* /*
* remote_partition_enable - Enable current cpuset as a remote partition root * remote_partition_enable - Enable current cpuset as a remote partition root
* @cs: the cpuset to update * @cs: the cpuset to update
* @new_prs: new partition_root_state
* @tmp: temparary masks * @tmp: temparary masks
* Return: 1 if successful, 0 if error * Return: 1 if successful, 0 if error
* *
* Enable the current cpuset to become a remote partition root taking CPUs * Enable the current cpuset to become a remote partition root taking CPUs
* directly from the top cpuset. cpuset_mutex must be held by the caller. * directly from the top cpuset. cpuset_mutex must be held by the caller.
*/ */
static int remote_partition_enable(struct cpuset *cs, struct tmpmasks *tmp) static int remote_partition_enable(struct cpuset *cs, int new_prs,
struct tmpmasks *tmp)
{ {
bool isolcpus_updated;
/* /*
* The user must have sysadmin privilege. * The user must have sysadmin privilege.
*/ */
@ -1485,26 +1600,22 @@ static int remote_partition_enable(struct cpuset *cs, struct tmpmasks *tmp)
return 0; return 0;
spin_lock_irq(&callback_lock); spin_lock_irq(&callback_lock);
cpumask_andnot(top_cpuset.effective_cpus, isolcpus_updated = partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
top_cpuset.effective_cpus, tmp->new_cpus); list_add(&cs->remote_sibling, &remote_children);
cpumask_or(subpartitions_cpus,
subpartitions_cpus, tmp->new_cpus);
if (cs->use_parent_ecpus) { if (cs->use_parent_ecpus) {
struct cpuset *parent = parent_cs(cs); struct cpuset *parent = parent_cs(cs);
cs->use_parent_ecpus = false; cs->use_parent_ecpus = false;
parent->child_ecpus_count--; parent->child_ecpus_count--;
} }
list_add(&cs->remote_sibling, &remote_children);
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(isolcpus_updated);
/* /*
* Proprogate changes in top_cpuset's effective_cpus down the hierarchy. * Proprogate changes in top_cpuset's effective_cpus down the hierarchy.
*/ */
update_tasks_cpumask(&top_cpuset, tmp->new_cpus); update_tasks_cpumask(&top_cpuset, tmp->new_cpus);
update_sibling_cpumasks(&top_cpuset, NULL, tmp); update_sibling_cpumasks(&top_cpuset, NULL, tmp);
return 1; return 1;
} }
@ -1519,23 +1630,22 @@ static int remote_partition_enable(struct cpuset *cs, struct tmpmasks *tmp)
*/ */
static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp) static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
{ {
bool isolcpus_updated;
compute_effective_exclusive_cpumask(cs, tmp->new_cpus); compute_effective_exclusive_cpumask(cs, tmp->new_cpus);
WARN_ON_ONCE(!is_remote_partition(cs)); WARN_ON_ONCE(!is_remote_partition(cs));
WARN_ON_ONCE(!cpumask_subset(tmp->new_cpus, subpartitions_cpus)); WARN_ON_ONCE(!cpumask_subset(tmp->new_cpus, subpartitions_cpus));
spin_lock_irq(&callback_lock); spin_lock_irq(&callback_lock);
cpumask_andnot(subpartitions_cpus,
subpartitions_cpus, tmp->new_cpus);
cpumask_and(tmp->new_cpus,
tmp->new_cpus, cpu_active_mask);
cpumask_or(top_cpuset.effective_cpus,
top_cpuset.effective_cpus, tmp->new_cpus);
list_del_init(&cs->remote_sibling); list_del_init(&cs->remote_sibling);
isolcpus_updated = partition_xcpus_del(cs->partition_root_state,
NULL, tmp->new_cpus);
cs->partition_root_state = -cs->partition_root_state; cs->partition_root_state = -cs->partition_root_state;
if (!cs->prs_err) if (!cs->prs_err)
cs->prs_err = PERR_INVCPUS; cs->prs_err = PERR_INVCPUS;
reset_partition_data(cs); reset_partition_data(cs);
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(isolcpus_updated);
/* /*
* Proprogate changes in top_cpuset's effective_cpus down the hierarchy. * Proprogate changes in top_cpuset's effective_cpus down the hierarchy.
@ -1557,6 +1667,8 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *newmask,
struct tmpmasks *tmp) struct tmpmasks *tmp)
{ {
bool adding, deleting; bool adding, deleting;
int prs = cs->partition_root_state;
int isolcpus_updated = 0;
if (WARN_ON_ONCE(!is_remote_partition(cs))) if (WARN_ON_ONCE(!is_remote_partition(cs)))
return; return;
@ -1580,21 +1692,12 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *newmask,
goto invalidate; goto invalidate;
spin_lock_irq(&callback_lock); spin_lock_irq(&callback_lock);
if (adding) { if (adding)
cpumask_or(subpartitions_cpus, isolcpus_updated += partition_xcpus_add(prs, NULL, tmp->addmask);
subpartitions_cpus, tmp->addmask); if (deleting)
cpumask_andnot(top_cpuset.effective_cpus, isolcpus_updated += partition_xcpus_del(prs, NULL, tmp->delmask);
top_cpuset.effective_cpus, tmp->addmask);
}
if (deleting) {
cpumask_andnot(subpartitions_cpus,
subpartitions_cpus, tmp->delmask);
cpumask_and(tmp->delmask,
tmp->delmask, cpu_active_mask);
cpumask_or(top_cpuset.effective_cpus,
top_cpuset.effective_cpus, tmp->delmask);
}
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(isolcpus_updated);
/* /*
* Proprogate changes in top_cpuset's effective_cpus down the hierarchy. * Proprogate changes in top_cpuset's effective_cpus down the hierarchy.
@ -1676,11 +1779,11 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
* @tmp: Temporary addmask and delmask * @tmp: Temporary addmask and delmask
* Return: 0 or a partition root state error code * Return: 0 or a partition root state error code
* *
* For partcmd_enable, the cpuset is being transformed from a non-partition * For partcmd_enable*, the cpuset is being transformed from a non-partition
* root to a partition root. The effective_xcpus (cpus_allowed if effective_xcpus * root to a partition root. The effective_xcpus (cpus_allowed if
* not set) mask of the given cpuset will be taken away from parent's * effective_xcpus not set) mask of the given cpuset will be taken away from
* effective_cpus. The function will return 0 if all the CPUs listed in * parent's effective_cpus. The function will return 0 if all the CPUs listed
* effective_xcpus can be granted or an error code will be returned. * in effective_xcpus can be granted or an error code will be returned.
* *
* For partcmd_disable, the cpuset is being transformed from a partition * For partcmd_disable, the cpuset is being transformed from a partition
* root back to a non-partition root. Any CPUs in effective_xcpus will be * root back to a non-partition root. Any CPUs in effective_xcpus will be
@ -1695,7 +1798,7 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
* *
* For partcmd_invalidate, the current partition will be made invalid. * For partcmd_invalidate, the current partition will be made invalid.
* *
* The partcmd_enable and partcmd_disable commands are used by * The partcmd_enable* and partcmd_disable commands are used by
* update_prstate(). An error code may be returned and the caller will check * update_prstate(). An error code may be returned and the caller will check
* for error. * for error.
* *
@ -1716,6 +1819,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
int part_error = PERR_NONE; /* Partition error? */ int part_error = PERR_NONE; /* Partition error? */
int subparts_delta = 0; int subparts_delta = 0;
struct cpumask *xcpus; /* cs effective_xcpus */ struct cpumask *xcpus; /* cs effective_xcpus */
int isolcpus_updated = 0;
bool nocpu; bool nocpu;
lockdep_assert_held(&cpuset_mutex); lockdep_assert_held(&cpuset_mutex);
@ -1760,7 +1864,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
nocpu = tasks_nocpu_error(parent, cs, xcpus); nocpu = tasks_nocpu_error(parent, cs, xcpus);
if (cmd == partcmd_enable) { if ((cmd == partcmd_enable) || (cmd == partcmd_enablei)) {
/* /*
* Enabling partition root is not allowed if its * Enabling partition root is not allowed if its
* effective_xcpus is empty or doesn't overlap with * effective_xcpus is empty or doesn't overlap with
@ -1783,6 +1887,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
cpumask_copy(tmp->delmask, xcpus); cpumask_copy(tmp->delmask, xcpus);
deleting = true; deleting = true;
subparts_delta++; subparts_delta++;
new_prs = (cmd == partcmd_enable) ? PRS_ROOT : PRS_ISOLATED;
} else if (cmd == partcmd_disable) { } else if (cmd == partcmd_disable) {
/* /*
* May need to add cpus to parent's effective_cpus for * May need to add cpus to parent's effective_cpus for
@ -1792,6 +1897,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus); cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus);
if (adding) if (adding)
subparts_delta--; subparts_delta--;
new_prs = PRS_MEMBER;
} else if (newmask) { } else if (newmask) {
/* /*
* Empty cpumask is not allowed * Empty cpumask is not allowed
@ -1940,38 +2046,28 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
* newly deleted ones will be added back to effective_cpus. * newly deleted ones will be added back to effective_cpus.
*/ */
spin_lock_irq(&callback_lock); spin_lock_irq(&callback_lock);
if (adding) {
if (parent == &top_cpuset)
cpumask_andnot(subpartitions_cpus,
subpartitions_cpus, tmp->addmask);
/*
* Some of the CPUs in effective_xcpus might have been offlined.
*/
cpumask_or(parent->effective_cpus,
parent->effective_cpus, tmp->addmask);
cpumask_and(parent->effective_cpus,
parent->effective_cpus, cpu_active_mask);
}
if (deleting) {
if (parent == &top_cpuset)
cpumask_or(subpartitions_cpus,
subpartitions_cpus, tmp->delmask);
cpumask_andnot(parent->effective_cpus,
parent->effective_cpus, tmp->delmask);
}
if (is_partition_valid(parent)) {
parent->nr_subparts += subparts_delta;
WARN_ON_ONCE(parent->nr_subparts < 0);
}
if (old_prs != new_prs) { if (old_prs != new_prs) {
cs->partition_root_state = new_prs; cs->partition_root_state = new_prs;
if (new_prs <= 0) if (new_prs <= 0)
cs->nr_subparts = 0; cs->nr_subparts = 0;
} }
/*
* Adding to parent's effective_cpus means deletion CPUs from cs
* and vice versa.
*/
if (adding)
isolcpus_updated += partition_xcpus_del(old_prs, parent,
tmp->addmask);
if (deleting)
isolcpus_updated += partition_xcpus_add(new_prs, parent,
tmp->delmask);
if (is_partition_valid(parent)) {
parent->nr_subparts += subparts_delta;
WARN_ON_ONCE(parent->nr_subparts < 0);
}
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(isolcpus_updated);
if ((old_prs != new_prs) && (cmd == partcmd_update)) if ((old_prs != new_prs) && (cmd == partcmd_update))
update_partition_exclusive(cs, new_prs); update_partition_exclusive(cs, new_prs);
@ -2948,6 +3044,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
int err = PERR_NONE, old_prs = cs->partition_root_state; int err = PERR_NONE, old_prs = cs->partition_root_state;
struct cpuset *parent = parent_cs(cs); struct cpuset *parent = parent_cs(cs);
struct tmpmasks tmpmask; struct tmpmasks tmpmask;
bool new_xcpus_state = false;
if (old_prs == new_prs) if (old_prs == new_prs)
return 0; return 0;
@ -2977,6 +3074,9 @@ static int update_prstate(struct cpuset *cs, int new_prs)
goto out; goto out;
if (!old_prs) { if (!old_prs) {
enum partition_cmd cmd = (new_prs == PRS_ROOT)
? partcmd_enable : partcmd_enablei;
/* /*
* cpus_allowed cannot be empty. * cpus_allowed cannot be empty.
*/ */
@ -2985,19 +3085,18 @@ static int update_prstate(struct cpuset *cs, int new_prs)
goto out; goto out;
} }
err = update_parent_effective_cpumask(cs, partcmd_enable, err = update_parent_effective_cpumask(cs, cmd, NULL, &tmpmask);
NULL, &tmpmask);
/* /*
* If an attempt to become local partition root fails, * If an attempt to become local partition root fails,
* try to become a remote partition root instead. * try to become a remote partition root instead.
*/ */
if (err && remote_partition_enable(cs, &tmpmask)) if (err && remote_partition_enable(cs, new_prs, &tmpmask))
err = 0; err = 0;
} else if (old_prs && new_prs) { } else if (old_prs && new_prs) {
/* /*
* A change in load balance state only, no change in cpumasks. * A change in load balance state only, no change in cpumasks.
*/ */
; new_xcpus_state = true;
} else { } else {
/* /*
* Switching back to member is always allowed even if it * Switching back to member is always allowed even if it
@ -3029,7 +3128,10 @@ static int update_prstate(struct cpuset *cs, int new_prs)
WRITE_ONCE(cs->prs_err, err); WRITE_ONCE(cs->prs_err, err);
if (!is_partition_valid(cs)) if (!is_partition_valid(cs))
reset_partition_data(cs); reset_partition_data(cs);
else if (new_xcpus_state)
partition_xcpus_newstate(old_prs, new_prs, cs->effective_xcpus);
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
update_unbound_workqueue_cpumask(new_xcpus_state);
/* Force update if switching back to member */ /* Force update if switching back to member */
update_cpumasks_hier(cs, &tmpmask, !new_prs ? HIER_CHECKALL : 0); update_cpumasks_hier(cs, &tmpmask, !new_prs ? HIER_CHECKALL : 0);
@ -3386,6 +3488,7 @@ typedef enum {
FILE_SUBPARTS_CPULIST, FILE_SUBPARTS_CPULIST,
FILE_EXCLUSIVE_CPULIST, FILE_EXCLUSIVE_CPULIST,
FILE_EFFECTIVE_XCPULIST, FILE_EFFECTIVE_XCPULIST,
FILE_ISOLATED_CPULIST,
FILE_CPU_EXCLUSIVE, FILE_CPU_EXCLUSIVE,
FILE_MEM_EXCLUSIVE, FILE_MEM_EXCLUSIVE,
FILE_MEM_HARDWALL, FILE_MEM_HARDWALL,
@ -3582,6 +3685,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
case FILE_SUBPARTS_CPULIST: case FILE_SUBPARTS_CPULIST:
seq_printf(sf, "%*pbl\n", cpumask_pr_args(subpartitions_cpus)); seq_printf(sf, "%*pbl\n", cpumask_pr_args(subpartitions_cpus));
break; break;
case FILE_ISOLATED_CPULIST:
seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
break;
default: default:
ret = -EINVAL; ret = -EINVAL;
} }
@ -3875,6 +3981,13 @@ static struct cftype dfl_files[] = {
.flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG, .flags = CFTYPE_ONLY_ON_ROOT | CFTYPE_DEBUG,
}, },
{
.name = "cpus.isolated",
.seq_show = cpuset_common_seq_show,
.private = FILE_ISOLATED_CPULIST,
.flags = CFTYPE_ONLY_ON_ROOT,
},
{ } /* terminate */ { } /* terminate */
}; };
@ -4194,6 +4307,7 @@ int __init cpuset_init(void)
BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_xcpus, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_xcpus, GFP_KERNEL));
BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
cpumask_setall(top_cpuset.cpus_allowed); cpumask_setall(top_cpuset.cpus_allowed);
nodes_setall(top_cpuset.mems_allowed); nodes_setall(top_cpuset.mems_allowed);
@ -4306,6 +4420,30 @@ void cpuset_force_rebuild(void)
force_rebuild = true; force_rebuild = true;
} }
/*
* Attempt to acquire a cpus_read_lock while a hotplug operation may be in
* progress.
* Return: true if successful, false otherwise
*
* To avoid circular lock dependency between cpuset_mutex and cpus_read_lock,
* cpus_read_trylock() is used here to acquire the lock.
*/
static bool cpuset_hotplug_cpus_read_trylock(void)
{
int retries = 0;
while (!cpus_read_trylock()) {
/*
* CPU hotplug still in progress. Retry 5 times
* with a 10ms wait before bailing out.
*/
if (++retries > 5)
return false;
msleep(10);
}
return true;
}
/** /**
* cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug
* @cs: cpuset in interest * @cs: cpuset in interest
@ -4322,6 +4460,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
bool cpus_updated; bool cpus_updated;
bool mems_updated; bool mems_updated;
bool remote; bool remote;
int partcmd = -1;
struct cpuset *parent; struct cpuset *parent;
retry: retry:
wait_event(cpuset_attach_wq, cs->attach_in_progress == 0); wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
@ -4353,11 +4492,13 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
compute_partition_effective_cpumask(cs, &new_cpus); compute_partition_effective_cpumask(cs, &new_cpus);
if (remote && cpumask_empty(&new_cpus) && if (remote && cpumask_empty(&new_cpus) &&
partition_is_populated(cs, NULL)) { partition_is_populated(cs, NULL) &&
cpuset_hotplug_cpus_read_trylock()) {
remote_partition_disable(cs, tmp); remote_partition_disable(cs, tmp);
compute_effective_cpumask(&new_cpus, cs, parent); compute_effective_cpumask(&new_cpus, cs, parent);
remote = false; remote = false;
cpuset_force_rebuild(); cpuset_force_rebuild();
cpus_read_unlock();
} }
/* /*
@ -4368,18 +4509,28 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
* partitions. * partitions.
*/ */
if (is_local_partition(cs) && (!is_partition_valid(parent) || if (is_local_partition(cs) && (!is_partition_valid(parent) ||
tasks_nocpu_error(parent, cs, &new_cpus))) { tasks_nocpu_error(parent, cs, &new_cpus)))
update_parent_effective_cpumask(cs, partcmd_invalidate, NULL, tmp); partcmd = partcmd_invalidate;
compute_effective_cpumask(&new_cpus, cs, parent);
cpuset_force_rebuild();
}
/* /*
* On the other hand, an invalid partition root may be transitioned * On the other hand, an invalid partition root may be transitioned
* back to a regular one. * back to a regular one.
*/ */
else if (is_partition_valid(parent) && is_partition_invalid(cs)) { else if (is_partition_valid(parent) && is_partition_invalid(cs))
update_parent_effective_cpumask(cs, partcmd_update, NULL, tmp); partcmd = partcmd_update;
if (is_partition_valid(cs)) {
/*
* cpus_read_lock needs to be held before calling
* update_parent_effective_cpumask(). To avoid circular lock
* dependency between cpuset_mutex and cpus_read_lock,
* cpus_read_trylock() is used here to acquire the lock.
*/
if (partcmd >= 0) {
if (!cpuset_hotplug_cpus_read_trylock())
goto update_tasks;
update_parent_effective_cpumask(cs, partcmd, NULL, tmp);
cpus_read_unlock();
if ((partcmd == partcmd_invalidate) || is_partition_valid(cs)) {
compute_partition_effective_cpumask(cs, &new_cpus); compute_partition_effective_cpumask(cs, &new_cpus);
cpuset_force_rebuild(); cpuset_force_rebuild();
} }

View File

@ -74,64 +74,109 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
} }
/** /**
* cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated tree * cgroup_rstat_push_children - push children cgroups into the given list
* @pos: current position * @head: current head of the list (= subtree root)
* @root: root of the tree to traversal * @child: first child of the root
* @cpu: target cpu * @cpu: target cpu
* Return: A new singly linked list of cgroups to be flush
* *
* Walks the updated rstat_cpu tree on @cpu from @root. %NULL @pos starts * Iteratively traverse down the cgroup_rstat_cpu updated tree level by
* the traversal and %NULL return indicates the end. During traversal, * level and push all the parents first before their next level children
* each returned cgroup is unlinked from the tree. Must be called with the * into a singly linked list built from the tail backward like "pushing"
* matching cgroup_rstat_cpu_lock held. * cgroups into a stack. The root is pushed by the caller.
*/
static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
struct cgroup *child, int cpu)
{
struct cgroup *chead = child; /* Head of child cgroup level */
struct cgroup *ghead = NULL; /* Head of grandchild cgroup level */
struct cgroup *parent, *grandchild;
struct cgroup_rstat_cpu *crstatc;
child->rstat_flush_next = NULL;
next_level:
while (chead) {
child = chead;
chead = child->rstat_flush_next;
parent = cgroup_parent(child);
/* updated_next is parent cgroup terminated */
while (child != parent) {
child->rstat_flush_next = head;
head = child;
crstatc = cgroup_rstat_cpu(child, cpu);
grandchild = crstatc->updated_children;
if (grandchild != child) {
/* Push the grand child to the next level */
crstatc->updated_children = child;
grandchild->rstat_flush_next = ghead;
ghead = grandchild;
}
child = crstatc->updated_next;
crstatc->updated_next = NULL;
}
}
if (ghead) {
chead = ghead;
ghead = NULL;
goto next_level;
}
return head;
}
/**
* cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
* @root: root of the cgroup subtree to traverse
* @cpu: target cpu
* Return: A singly linked list of cgroups to be flushed
*
* Walks the updated rstat_cpu tree on @cpu from @root. During traversal,
* each returned cgroup is unlinked from the updated tree.
* *
* The only ordering guarantee is that, for a parent and a child pair * The only ordering guarantee is that, for a parent and a child pair
* covered by a given traversal, if a child is visited, its parent is * covered by a given traversal, the child is before its parent in
* guaranteed to be visited afterwards. * the list.
*
* Note that updated_children is self terminated and points to a list of
* child cgroups if not empty. Whereas updated_next is like a sibling link
* within the children list and terminated by the parent cgroup. An exception
* here is the cgroup root whose updated_next can be self terminated.
*/ */
static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
struct cgroup *root, int cpu)
{ {
struct cgroup_rstat_cpu *rstatc; raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
struct cgroup *parent; struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
struct cgroup *head = NULL, *parent, *child;
if (pos == root) unsigned long flags;
return NULL;
/* /*
* We're gonna walk down to the first leaf and visit/remove it. We * The _irqsave() is needed because cgroup_rstat_lock is
* can pick whatever unvisited node as the starting point. * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
* this lock with the _irq() suffix only disables interrupts on
* a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
* interrupts on both configurations. The _irqsave() ensures
* that interrupts are always disabled and later restored.
*/ */
if (!pos) { raw_spin_lock_irqsave(cpu_lock, flags);
pos = root;
/* return NULL if this subtree is not on-list */
if (!cgroup_rstat_cpu(pos, cpu)->updated_next)
return NULL;
} else {
pos = cgroup_parent(pos);
}
/* walk down to the first leaf */ /* Return NULL if this subtree is not on-list */
while (true) { if (!rstatc->updated_next)
rstatc = cgroup_rstat_cpu(pos, cpu); goto unlock_ret;
if (rstatc->updated_children == pos)
break;
pos = rstatc->updated_children;
}
/* /*
* Unlink @pos from the tree. As the updated_children list is * Unlink @root from its parent. As the updated_children list is
* singly linked, we have to walk it to find the removal point. * singly linked, we have to walk it to find the removal point.
* However, due to the way we traverse, @pos will be the first
* child in most cases. The only exception is @root.
*/ */
parent = cgroup_parent(pos); parent = cgroup_parent(root);
if (parent) { if (parent) {
struct cgroup_rstat_cpu *prstatc; struct cgroup_rstat_cpu *prstatc;
struct cgroup **nextp; struct cgroup **nextp;
prstatc = cgroup_rstat_cpu(parent, cpu); prstatc = cgroup_rstat_cpu(parent, cpu);
nextp = &prstatc->updated_children; nextp = &prstatc->updated_children;
while (*nextp != pos) { while (*nextp != root) {
struct cgroup_rstat_cpu *nrstatc; struct cgroup_rstat_cpu *nrstatc;
nrstatc = cgroup_rstat_cpu(*nextp, cpu); nrstatc = cgroup_rstat_cpu(*nextp, cpu);
@ -142,7 +187,17 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
} }
rstatc->updated_next = NULL; rstatc->updated_next = NULL;
return pos;
/* Push @root to the list first before pushing the children */
head = root;
root->rstat_flush_next = NULL;
child = rstatc->updated_children;
rstatc->updated_children = root;
if (child != root)
head = cgroup_rstat_push_children(head, child, cpu);
unlock_ret:
raw_spin_unlock_irqrestore(cpu_lock, flags);
return head;
} }
/* /*
@ -176,21 +231,9 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
lockdep_assert_held(&cgroup_rstat_lock); lockdep_assert_held(&cgroup_rstat_lock);
for_each_possible_cpu(cpu) { for_each_possible_cpu(cpu) {
raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
cpu);
struct cgroup *pos = NULL;
unsigned long flags;
/* for (; pos; pos = pos->rstat_flush_next) {
* The _irqsave() is needed because cgroup_rstat_lock is
* spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
* this lock with the _irq() suffix only disables interrupts on
* a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
* interrupts on both configurations. The _irqsave() ensures
* that interrupts are always disabled and later restored.
*/
raw_spin_lock_irqsave(cpu_lock, flags);
while ((pos = cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
struct cgroup_subsys_state *css; struct cgroup_subsys_state *css;
cgroup_base_stat_flush(pos, cpu); cgroup_base_stat_flush(pos, cpu);
@ -202,7 +245,6 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
css->ss->css_rstat_flush(css, cpu); css->ss->css_rstat_flush(css, cpu);
rcu_read_unlock(); rcu_read_unlock();
} }
raw_spin_unlock_irqrestore(cpu_lock, flags);
/* play nice and yield if necessary */ /* play nice and yield if necessary */
if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {

View File

@ -381,6 +381,12 @@ static bool workqueue_freezing; /* PL: have wqs started freezing? */
/* PL&A: allowable cpus for unbound wqs and work items */ /* PL&A: allowable cpus for unbound wqs and work items */
static cpumask_var_t wq_unbound_cpumask; static cpumask_var_t wq_unbound_cpumask;
/* PL: user requested unbound cpumask via sysfs */
static cpumask_var_t wq_requested_unbound_cpumask;
/* PL: isolated cpumask to be excluded from unbound cpumask */
static cpumask_var_t wq_isolated_cpumask;
/* for further constrain wq_unbound_cpumask by cmdline parameter*/ /* for further constrain wq_unbound_cpumask by cmdline parameter*/
static struct cpumask wq_cmdline_cpumask __initdata; static struct cpumask wq_cmdline_cpumask __initdata;
@ -4408,19 +4414,6 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
mutex_unlock(&ctx->wq->mutex); mutex_unlock(&ctx->wq->mutex);
} }
static void apply_wqattrs_lock(void)
{
/* CPUs should stay stable across pwq creations and installations */
cpus_read_lock();
mutex_lock(&wq_pool_mutex);
}
static void apply_wqattrs_unlock(void)
{
mutex_unlock(&wq_pool_mutex);
cpus_read_unlock();
}
static int apply_workqueue_attrs_locked(struct workqueue_struct *wq, static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs) const struct workqueue_attrs *attrs)
{ {
@ -5825,39 +5818,40 @@ static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
} }
/** /**
* workqueue_set_unbound_cpumask - Set the low-level unbound cpumask * workqueue_unbound_exclude_cpumask - Exclude given CPUs from unbound cpumask
* @cpumask: the cpumask to set * @exclude_cpumask: the cpumask to be excluded from wq_unbound_cpumask
* *
* The low-level workqueues cpumask is a global cpumask that limits * This function can be called from cpuset code to provide a set of isolated
* the affinity of all unbound workqueues. This function check the @cpumask * CPUs that should be excluded from wq_unbound_cpumask. The caller must hold
* and apply it to all unbound workqueues and updates all pwqs of them. * either cpus_read_lock or cpus_write_lock.
*
* Return: 0 - Success
* -EINVAL - Invalid @cpumask
* -ENOMEM - Failed to allocate memory for attrs or pwqs.
*/ */
int workqueue_set_unbound_cpumask(cpumask_var_t cpumask) int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
{ {
int ret = -EINVAL; cpumask_var_t cpumask;
int ret = 0;
if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
return -ENOMEM;
lockdep_assert_cpus_held();
mutex_lock(&wq_pool_mutex);
/* Save the current isolated cpumask & export it via sysfs */
cpumask_copy(wq_isolated_cpumask, exclude_cpumask);
/* /*
* Not excluding isolated cpus on purpose. * If the operation fails, it will fall back to
* If the user wishes to include them, we allow that. * wq_requested_unbound_cpumask which is initially set to
* (HK_TYPE_WQ HK_TYPE_DOMAIN) house keeping mask and rewritten
* by any subsequent write to workqueue/cpumask sysfs file.
*/ */
cpumask_and(cpumask, cpumask, cpu_possible_mask); if (!cpumask_andnot(cpumask, wq_requested_unbound_cpumask, exclude_cpumask))
if (!cpumask_empty(cpumask)) { cpumask_copy(cpumask, wq_requested_unbound_cpumask);
apply_wqattrs_lock(); if (!cpumask_equal(cpumask, wq_unbound_cpumask))
if (cpumask_equal(cpumask, wq_unbound_cpumask)) {
ret = 0;
goto out_unlock;
}
ret = workqueue_apply_unbound_cpumask(cpumask); ret = workqueue_apply_unbound_cpumask(cpumask);
out_unlock: mutex_unlock(&wq_pool_mutex);
apply_wqattrs_unlock(); free_cpumask_var(cpumask);
}
return ret; return ret;
} }
@ -5979,6 +5973,19 @@ static struct attribute *wq_sysfs_attrs[] = {
}; };
ATTRIBUTE_GROUPS(wq_sysfs); ATTRIBUTE_GROUPS(wq_sysfs);
static void apply_wqattrs_lock(void)
{
/* CPUs should stay stable across pwq creations and installations */
cpus_read_lock();
mutex_lock(&wq_pool_mutex);
}
static void apply_wqattrs_unlock(void)
{
mutex_unlock(&wq_pool_mutex);
cpus_read_unlock();
}
static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr, static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
char *buf) char *buf)
{ {
@ -6155,19 +6162,74 @@ static struct bus_type wq_subsys = {
.dev_groups = wq_sysfs_groups, .dev_groups = wq_sysfs_groups,
}; };
static ssize_t wq_unbound_cpumask_show(struct device *dev, /**
struct device_attribute *attr, char *buf) * workqueue_set_unbound_cpumask - Set the low-level unbound cpumask
* @cpumask: the cpumask to set
*
* The low-level workqueues cpumask is a global cpumask that limits
* the affinity of all unbound workqueues. This function check the @cpumask
* and apply it to all unbound workqueues and updates all pwqs of them.
*
* Return: 0 - Success
* -EINVAL - Invalid @cpumask
* -ENOMEM - Failed to allocate memory for attrs or pwqs.
*/
static int workqueue_set_unbound_cpumask(cpumask_var_t cpumask)
{
int ret = -EINVAL;
/*
* Not excluding isolated cpus on purpose.
* If the user wishes to include them, we allow that.
*/
cpumask_and(cpumask, cpumask, cpu_possible_mask);
if (!cpumask_empty(cpumask)) {
apply_wqattrs_lock();
cpumask_copy(wq_requested_unbound_cpumask, cpumask);
if (cpumask_equal(cpumask, wq_unbound_cpumask)) {
ret = 0;
goto out_unlock;
}
ret = workqueue_apply_unbound_cpumask(cpumask);
out_unlock:
apply_wqattrs_unlock();
}
return ret;
}
static ssize_t __wq_cpumask_show(struct device *dev,
struct device_attribute *attr, char *buf, cpumask_var_t mask)
{ {
int written; int written;
mutex_lock(&wq_pool_mutex); mutex_lock(&wq_pool_mutex);
written = scnprintf(buf, PAGE_SIZE, "%*pb\n", written = scnprintf(buf, PAGE_SIZE, "%*pb\n", cpumask_pr_args(mask));
cpumask_pr_args(wq_unbound_cpumask));
mutex_unlock(&wq_pool_mutex); mutex_unlock(&wq_pool_mutex);
return written; return written;
} }
static ssize_t wq_unbound_cpumask_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
return __wq_cpumask_show(dev, attr, buf, wq_unbound_cpumask);
}
static ssize_t wq_requested_cpumask_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
return __wq_cpumask_show(dev, attr, buf, wq_requested_unbound_cpumask);
}
static ssize_t wq_isolated_cpumask_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
return __wq_cpumask_show(dev, attr, buf, wq_isolated_cpumask);
}
static ssize_t wq_unbound_cpumask_store(struct device *dev, static ssize_t wq_unbound_cpumask_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count) struct device_attribute *attr, const char *buf, size_t count)
{ {
@ -6185,9 +6247,13 @@ static ssize_t wq_unbound_cpumask_store(struct device *dev,
return ret ? ret : count; return ret ? ret : count;
} }
static struct device_attribute wq_sysfs_cpumask_attr = static struct device_attribute wq_sysfs_cpumask_attrs[] = {
__ATTR(cpumask, 0644, wq_unbound_cpumask_show, __ATTR(cpumask, 0644, wq_unbound_cpumask_show,
wq_unbound_cpumask_store); wq_unbound_cpumask_store),
__ATTR(cpumask_requested, 0444, wq_requested_cpumask_show, NULL),
__ATTR(cpumask_isolated, 0444, wq_isolated_cpumask_show, NULL),
__ATTR_NULL,
};
static int __init wq_sysfs_init(void) static int __init wq_sysfs_init(void)
{ {
@ -6200,7 +6266,13 @@ static int __init wq_sysfs_init(void)
dev_root = bus_get_dev_root(&wq_subsys); dev_root = bus_get_dev_root(&wq_subsys);
if (dev_root) { if (dev_root) {
err = device_create_file(dev_root, &wq_sysfs_cpumask_attr); struct device_attribute *attr;
for (attr = wq_sysfs_cpumask_attrs; attr->attr.name; attr++) {
err = device_create_file(dev_root, attr);
if (err)
break;
}
put_device(dev_root); put_device(dev_root);
} }
return err; return err;
@ -6542,12 +6614,17 @@ void __init workqueue_init_early(void)
BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long)); BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
BUG_ON(!alloc_cpumask_var(&wq_requested_unbound_cpumask, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&wq_isolated_cpumask, GFP_KERNEL));
cpumask_copy(wq_unbound_cpumask, cpu_possible_mask); cpumask_copy(wq_unbound_cpumask, cpu_possible_mask);
restrict_unbound_cpumask("HK_TYPE_WQ", housekeeping_cpumask(HK_TYPE_WQ)); restrict_unbound_cpumask("HK_TYPE_WQ", housekeeping_cpumask(HK_TYPE_WQ));
restrict_unbound_cpumask("HK_TYPE_DOMAIN", housekeeping_cpumask(HK_TYPE_DOMAIN)); restrict_unbound_cpumask("HK_TYPE_DOMAIN", housekeeping_cpumask(HK_TYPE_DOMAIN));
if (!cpumask_empty(&wq_cmdline_cpumask)) if (!cpumask_empty(&wq_cmdline_cpumask))
restrict_unbound_cpumask("workqueue.unbound_cpus", &wq_cmdline_cpumask); restrict_unbound_cpumask("workqueue.unbound_cpus", &wq_cmdline_cpumask);
cpumask_copy(wq_requested_unbound_cpumask, wq_unbound_cpumask);
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC); pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
wq_update_pod_attrs_buf = alloc_workqueue_attrs(); wq_update_pod_attrs_buf = alloc_workqueue_attrs();

View File

@ -146,71 +146,6 @@ test_add_proc()
echo $$ > $CGROUP2/cgroup.procs # Move out the task echo $$ > $CGROUP2/cgroup.procs # Move out the task
} }
#
# Testing the new "isolated" partition root type
#
test_isolated()
{
cd $CGROUP2/test
echo 2-3 > cpuset.cpus
TYPE=$(cat cpuset.cpus.partition)
[[ $TYPE = member ]] || echo member > cpuset.cpus.partition
console_msg "Change from member to root"
test_partition root
console_msg "Change from root to isolated"
test_partition isolated
console_msg "Change from isolated to member"
test_partition member
console_msg "Change from member to isolated"
test_partition isolated
console_msg "Change from isolated to root"
test_partition root
console_msg "Change from root to member"
test_partition member
#
# Testing partition root with no cpu
#
console_msg "Distribute all cpus to child partition"
echo +cpuset > cgroup.subtree_control
test_partition root
mkdir A1
cd A1
echo 2-3 > cpuset.cpus
test_partition root
test_effective_cpus 2-3
cd ..
test_effective_cpus ""
console_msg "Moving task to partition test"
test_add_proc "No space left"
cd A1
test_add_proc ""
cd ..
console_msg "Shrink and expand child partition"
cd A1
echo 2 > cpuset.cpus
cd ..
test_effective_cpus 3
cd A1
echo 2-3 > cpuset.cpus
cd ..
test_effective_cpus ""
# Cleaning up
console_msg "Cleaning up"
echo $$ > $CGROUP2/cgroup.procs
[[ -d A1 ]] && rmdir A1
}
# #
# Cpuset controller state transition test matrix. # Cpuset controller state transition test matrix.
# #
@ -297,14 +232,14 @@ TEST_MATRIX=(
" C0-3:S+ C1-3:S+ C2-3 C4-5 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:,A3:2-3,B1:4-5 \ " C0-3:S+ C1-3:S+ C2-3 C4-5 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:,A3:2-3,B1:4-5 \
A1:P0,A2:P1,A3:P2,B1:P1 2-3" A1:P0,A2:P1,A3:P2,B1:P1 2-3"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:,A3:2-3,B1:4 \ " C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:,A3:2-3,B1:4 \
A1:P0,A2:P1,A3:P2,B1:P1 2-4" A1:P0,A2:P1,A3:P2,B1:P1 2-4,2-3"
" C0-3:S+ C1-3:S+ C3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:2,A3:3,B1:4 \ " C0-3:S+ C1-3:S+ C3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:2,A3:3,B1:4 \
A1:P0,A2:P1,A3:P2,B1:P1 2-4" A1:P0,A2:P1,A3:P2,B1:P1 2-4,3"
" C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X4:P1 . 0 A1:0-1,A2:2-3,A3:4 \ " C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X4:P1 . 0 A1:0-1,A2:2-3,A3:4 \
A1:P0,A2:P2,A3:P1 2-4" A1:P0,A2:P2,A3:P1 2-4,2-3"
" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \ " C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
. . X5 . . 0 A1:0-4,A2:1-4,A3:2-4 \ . . X5 . . 0 A1:0-4,A2:1-4,A3:2-4 \
A1:P0,A2:P-2,A3:P-1 ." A1:P0,A2:P-2,A3:P-1"
" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \ " C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
. . . X1 . 0 A1:0-1,A2:2-4,A3:2-4 \ . . . X1 . 0 A1:0-1,A2:2-4,A3:2-4 \
A1:P0,A2:P2,A3:P-1 2-4" A1:P0,A2:P2,A3:P-1 2-4"
@ -313,7 +248,7 @@ TEST_MATRIX=(
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 . 0 A1:0-1,A2:1,A3:3 A1:P0,A3:P2 2-3" " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 . 0 A1:0-1,A2:1,A3:3 A1:P0,A3:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 O2=1 0 A1:0-1,A2:1,A3:2-3 A1:P0,A3:P2 2-3" " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:O2=0 O2=1 0 A1:0-1,A2:1,A3:2-3 A1:P0,A3:P2 2-3"
" C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=0 . 0 A1:0-2,A2:1-2,A3: A1:P0,A3:P2 3" " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 P2:O3=0 . 0 A1:0-2,A2:1-2,A3: A1:P0,A3:P2 3"
" C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=0 . 0 A1:0-2,A2:1-2,A3:1-2 A1:P0,A3:P-2 3" " C0-3:S+ C1-3:S+ C3 . X2-3 X2-3 T:P2:O3=0 . 0 A1:0-2,A2:1-2,A3:1-2 A1:P0,A3:P-2 3,"
# An invalidated remote partition cannot self-recover from hotplug # An invalidated remote partition cannot self-recover from hotplug
" C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=0 O2=1 0 A1:0-3,A2:1-3,A3:2 A1:P0,A3:P-2" " C0-3:S+ C1-3:S+ C2 . X2-3 X2-3 T:P2:O2=0 O2=1 0 A1:0-3,A2:1-3,A3:2 A1:P0,A3:P-2"
@ -347,10 +282,10 @@ TEST_MATRIX=(
# cpus_allowed/exclusive_cpus update tests # cpus_allowed/exclusive_cpus update tests
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \ " C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. C4 . P2 . 0 A1:4,A2:4,XA2:,XA3:,A3:4 \ . C4 . P2 . 0 A1:4,A2:4,XA2:,XA3:,A3:4 \
A1:P0,A3:P-2 ." A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \ " C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. X1 . P2 . 0 A1:0-3,A2:1-3,XA1:1,XA2:,XA3:,A3:2-3 \ . X1 . P2 . 0 A1:0-3,A2:1-3,XA1:1,XA2:,XA3:,A3:2-3 \
A1:P0,A3:P-2 ." A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \ " C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. . C3 P2 . 0 A1:0-2,A2:0-2,XA2:3,XA3:3,A3:3 \ . . C3 P2 . 0 A1:0-2,A2:0-2,XA2:3,XA3:3,A3:3 \
A1:P0,A3:P2 3" A1:P0,A3:P2 3"
@ -359,13 +294,13 @@ TEST_MATRIX=(
A1:P0,A3:P2 3" A1:P0,A3:P2 3"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \ " C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
. . X3 . . 0 A1:0-3,A2:1-3,XA2:3,XA3:3,A3:2-3 \ . . X3 . . 0 A1:0-3,A2:1-3,XA2:3,XA3:3,A3:2-3 \
A1:P0,A3:P-2 ." A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \ " C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
. . C3 . . 0 A1:0-3,A2:3,XA2:3,XA3:3,A3:3 \ . . C3 . . 0 A1:0-3,A2:3,XA2:3,XA3:3,A3:3 \
A1:P0,A3:P-2 ." A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \ " C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
. C4 . . . 0 A1:4,A2:4,A3:4,XA1:,XA2:,XA3 \ . C4 . . . 0 A1:4,A2:4,A3:4,XA1:,XA2:,XA3 \
A1:P0,A3:P-2 ." A1:P0,A3:P-2"
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS # old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
# ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ -------- # ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
@ -441,7 +376,7 @@ write_cpu_online()
} }
fi fi
echo $VAL > $CPUFILE echo $VAL > $CPUFILE
pause 0.01 pause 0.05
} }
# #
@ -573,12 +508,14 @@ dump_states()
XECPUS=$DIR/cpuset.cpus.exclusive.effective XECPUS=$DIR/cpuset.cpus.exclusive.effective
PRS=$DIR/cpuset.cpus.partition PRS=$DIR/cpuset.cpus.partition
PCPUS=$DIR/.__DEBUG__.cpuset.cpus.subpartitions PCPUS=$DIR/.__DEBUG__.cpuset.cpus.subpartitions
ISCPUS=$DIR/cpuset.cpus.isolated
[[ -e $CPUS ]] && echo "$CPUS: $(cat $CPUS)" [[ -e $CPUS ]] && echo "$CPUS: $(cat $CPUS)"
[[ -e $XCPUS ]] && echo "$XCPUS: $(cat $XCPUS)" [[ -e $XCPUS ]] && echo "$XCPUS: $(cat $XCPUS)"
[[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)" [[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)"
[[ -e $XECPUS ]] && echo "$XECPUS: $(cat $XECPUS)" [[ -e $XECPUS ]] && echo "$XECPUS: $(cat $XECPUS)"
[[ -e $PRS ]] && echo "$PRS: $(cat $PRS)" [[ -e $PRS ]] && echo "$PRS: $(cat $PRS)"
[[ -e $PCPUS ]] && echo "$PCPUS: $(cat $PCPUS)" [[ -e $PCPUS ]] && echo "$PCPUS: $(cat $PCPUS)"
[[ -e $ISCPUS ]] && echo "$ISCPUS: $(cat $ISCPUS)"
done done
} }
@ -656,11 +593,17 @@ check_cgroup_states()
# #
# Get isolated (including offline) CPUs by looking at # Get isolated (including offline) CPUs by looking at
# /sys/kernel/debug/sched/domains and compare that with the expected value. # /sys/kernel/debug/sched/domains and cpuset.cpus.isolated control file,
# if available, and compare that with the expected value.
# #
# Note that a sched domain of just 1 CPU will be considered isolated. # Note that isolated CPUs from the sched/domains context include offline
# CPUs as well as CPUs in non-isolated 1-CPU partition. Those CPUs may
# not be included in the cpuset.cpus.isolated control file which contains
# only CPUs in isolated partitions.
# #
# $1 - expected isolated cpu list # $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
# <isolcpus1> - expected sched/domains value
# <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
# #
check_isolcpus() check_isolcpus()
{ {
@ -668,8 +611,38 @@ check_isolcpus()
ISOLCPUS= ISOLCPUS=
LASTISOLCPU= LASTISOLCPU=
SCHED_DOMAINS=/sys/kernel/debug/sched/domains SCHED_DOMAINS=/sys/kernel/debug/sched/domains
ISCPUS=${CGROUP2}/cpuset.cpus.isolated
if [[ $EXPECT_VAL = . ]]
then
EXPECT_VAL=
EXPECT_VAL2=
elif [[ $(expr $EXPECT_VAL : ".*,.*") > 0 ]]
then
set -- $(echo $EXPECT_VAL | sed -e "s/,/ /g")
EXPECT_VAL=$1
EXPECT_VAL2=$2
else
EXPECT_VAL2=$EXPECT_VAL
fi
#
# Check the debug isolated cpumask, if present
#
[[ -f $ISCPUS ]] && {
ISOLCPUS=$(cat $ISCPUS)
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && {
# Take a 50ms pause and try again
pause 0.05
ISOLCPUS=$(cat $ISCPUS)
}
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && return 1
ISOLCPUS=
}
#
# Use the sched domain in debugfs to check isolated CPUs, if available
#
[[ -d $SCHED_DOMAINS ]] || return 0 [[ -d $SCHED_DOMAINS ]] || return 0
[[ $EXPECT_VAL = . ]] && EXPECT_VAL=
for ((CPU=0; CPU < $NR_CPUS; CPU++)) for ((CPU=0; CPU < $NR_CPUS; CPU++))
do do
@ -713,6 +686,26 @@ test_fail()
exit 1 exit 1
} }
#
# Check to see if there are unexpected isolated CPUs left
#
null_isolcpus_check()
{
[[ $VERBOSE -gt 0 ]] || return 0
# Retry a few times before printing error
RETRY=0
while [[ $RETRY -lt 5 ]]
do
pause 0.01
check_isolcpus "."
[[ $? -eq 0 ]] && return 0
((RETRY++))
done
echo "Unexpected isolated CPUs: $ISOLCPUS"
dump_states
exit 1
}
# #
# Run cpuset state transition test # Run cpuset state transition test
# $1 - test matrix name # $1 - test matrix name
@ -787,7 +780,7 @@ run_state_test()
# #
NEWLIST=$(cat cpuset.cpus.effective) NEWLIST=$(cat cpuset.cpus.effective)
RETRY=0 RETRY=0
while [[ $NEWLIST != $CPULIST && $RETRY -lt 5 ]] while [[ $NEWLIST != $CPULIST && $RETRY -lt 8 ]]
do do
# Wait a bit longer & recheck a few times # Wait a bit longer & recheck a few times
pause 0.01 pause 0.01
@ -798,12 +791,79 @@ run_state_test()
echo "Effective cpus changed to $NEWLIST after test $I!" echo "Effective cpus changed to $NEWLIST after test $I!"
exit 1 exit 1
} }
null_isolcpus_check
[[ $VERBOSE -gt 0 ]] && echo "Test $I done." [[ $VERBOSE -gt 0 ]] && echo "Test $I done."
((I++)) ((I++))
done done
echo "All $I tests of $TEST PASSED." echo "All $I tests of $TEST PASSED."
} }
#
# Testing the new "isolated" partition root type
#
test_isolated()
{
cd $CGROUP2/test
echo 2-3 > cpuset.cpus
TYPE=$(cat cpuset.cpus.partition)
[[ $TYPE = member ]] || echo member > cpuset.cpus.partition
console_msg "Change from member to root"
test_partition root
console_msg "Change from root to isolated"
test_partition isolated
console_msg "Change from isolated to member"
test_partition member
console_msg "Change from member to isolated"
test_partition isolated
console_msg "Change from isolated to root"
test_partition root
console_msg "Change from root to member"
test_partition member
#
# Testing partition root with no cpu
#
console_msg "Distribute all cpus to child partition"
echo +cpuset > cgroup.subtree_control
test_partition root
mkdir A1
cd A1
echo 2-3 > cpuset.cpus
test_partition root
test_effective_cpus 2-3
cd ..
test_effective_cpus ""
console_msg "Moving task to partition test"
test_add_proc "No space left"
cd A1
test_add_proc ""
cd ..
console_msg "Shrink and expand child partition"
cd A1
echo 2 > cpuset.cpus
cd ..
test_effective_cpus 3
cd A1
echo 2-3 > cpuset.cpus
cd ..
test_effective_cpus ""
# Cleaning up
console_msg "Cleaning up"
echo $$ > $CGROUP2/cgroup.procs
[[ -d A1 ]] && rmdir A1
null_isolcpus_check
}
# #
# Wait for inotify event for the given file and read it # Wait for inotify event for the given file and read it
# $1: cgroup file to wait for # $1: cgroup file to wait for

View File

@ -740,7 +740,7 @@ static int test_cgfreezer_ptraced(const char *root)
/* /*
* cg_check_frozen(cgroup, true) will fail here, * cg_check_frozen(cgroup, true) will fail here,
* because the task in in the TRACEd state. * because the task is in the TRACEd state.
*/ */
if (cg_freeze_wait(cgroup, false)) if (cg_freeze_wait(cgroup, false))
goto cleanup; goto cleanup;