Commit Graph

486 Commits

Author SHA1 Message Date
Tejun Heo
026e14a276 Merge branch 'for-6.0-fixes' into for-6.1
for-6.0 has the following fix for cgroup_get_from_id().

  836ac87d ("cgroup: fix cgroup_get_from_id")

which conflicts with the following two commits in for-6.1.

  4534dee9 ("cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id")
  fa7e439c ("cgroup: Homogenize cgroup_get_from_id() return value")

While the resolution is straightforward, the code ends up pretty ugly
afterwards. Let's pull for-6.0-fixes into for-6.1 so that the code can be
fixed up there.

Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-23 07:19:38 -10:00
Ming Lei
df02452f3d cgroup: cgroup_get_from_id() must check the looked-up kn is a directory
cgroup has to be one kernfs dir, otherwise kernel panic is caused,
especially cgroup id is provide from userspace.

Reported-by: Marco Patalano <mpatalan@redhat.com>
Fixes: 6b658c4863 ("scsi: cgroup: Add cgroup_get_from_id()")
Cc: Muneendra <muneendra.kumar@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-23 07:18:45 -10:00
Jiapeng Chong
c478bd8836 cgroup/cpuset: remove unreachable code
The function sched_partition_show cannot execute seq_puts, delete the
invalid code.

kernel/cgroup/cpuset.c:2849 sched_partition_show() warn: ignoring unreachable code.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2087
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-07 05:30:29 -10:00
Tejun Heo
8a693f7766 cgroup: Remove CFTYPE_PRESSURE
CFTYPE_PRESSURE is used to flag PSI related files so that they are not
created if PSI is disabled during boot. It's a bit weird to use a generic
flag to mark a specific file type. Let's instead move the PSI files into its
own cftypes array and add/rm them conditionally. This is a bit more code but
cleaner.

No userland visible changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2022-09-06 09:38:55 -10:00
Tejun Heo
0083d27b21 cgroup: Improve cftype add/rm error handling
Let's track whether a cftype is currently added or not using a new flag
__CFTYPE_ADDED so that duplicate operations can be failed safely and
consistently allow using empty cftypes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-06 09:38:42 -10:00
Waiman Long
d7c8142d5a cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule
Currently, changes in "cpust.cpus" of a partition root is not allowed if
it violates the sibling cpu exclusivity rule when the check is done
in the validate_change() function. That is inconsistent with the
other cpuset changes that are always allowed but may make a partition
invalid.

Update the cpuset code to allow cpumask change even if it violates the
sibling cpu exclusivity rule, but invalidate the partition instead
just like the other changes. However, other sibling partitions with
conflicting cpumask will also be invalidated in order to not violating
the exclusivity rule. This behavior is specific to this partition
rule violation.

Note that a previous commit has made sibling cpu exclusivity rule check
the last check of validate_change(). So if -EINVAL is returned, we can
be sure that sibling cpu exclusivity rule violation is the only rule
that is broken.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:28 -10:00
Waiman Long
74027a6535 cgroup/cpuset: Relocate a code block in validate_change()
This patch moves down the exclusive cpu and memory check in
validate_change(). There is no functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:28 -10:00
Waiman Long
7476a636d3 cgroup/cpuset: Show invalid partition reason string
There are a number of different reasons which can cause a partition to
become invalid. A user seeing an invalid partition may not know exactly
why. To help user to get a better understanding of the underlying reason,
The cpuset.cpus.partition control file, when read, will now report the
reason why a partition become invalid. When a partition does become
invalid, reading the control file will show "root invalid (<reason>)"
where <reason> is a string that describes why the partition is invalid.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long
f28e22441f cgroup/cpuset: Add a new isolated cpus.partition type
Cpuset v1 uses the sched_load_balance control file to determine if load
balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
as its use may require disabling load balancing at cgroup root.

For workloads that require very low latency like DPDK, the latency
jitters caused by periodic load balancing may exceed the desired
latency limit.

When cpuset v2 is in use, the only way to avoid this latency cost is to
use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
the kernel boot, however, there is no way to add or remove CPUs from
this isolated set. For workloads that are more dynamic in nature, that
means users have to provision enough CPUs for the worst case situation
resulting in excess idle CPUs.

To address this issue for cpuset v2, a new cpuset.cpus.partition type
"isolated" is added which allows the creation of a cpuset partition
without load balancing. This will allow system administrators to
dynamically adjust the size of isolated partition to the current need
of the workload without rebooting the system.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long
f0af1bfc27 cgroup/cpuset: Relax constraints to partition & cpus changes
Currently, enabling a partition root is only allowed if all the
constraints of a valid partition are satisfied. Even changes to
"cpuset.cpus" may not be allowed in some cases. Moreover, there are
limits to changes made to a parent cpuset if it is a valid partition
root. This is contrary to the general cgroup v2 philosophy.

This patch relaxes the constraints of changing the state of "cpuset.cpus"
and "cpuset.cpus.partition". Now all valid changes ("member" or "root")
to "cpuset.cpus.partition" are allowed even if there are child cpusets
underneath it.

Trying to make a cpuset a partition root, however, will cause its state
to become invalid if the following constraints of a valid partition
root are not satisfied.

 1) The "cpuset.cpus" is non-empty and exclusive.
 2) The parent cpuset is a valid partition root.
 3) The "cpuset.cpus" overlaps parent's "cpuset.cpus".

Similarly, almost all changes to "cpuset.cpus" are allowed with the
exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the
exclusivity rule will still apply.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long
e2d59900d9 cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
Currently, a partition root cannot have empty "cpuset.cpus.effective".
As a result, a parent partition root cannot distribute out all its
CPUs to child partitions with no CPUs left. However in most cases,
there shouldn't be any tasks associated with intermediate nodes of the
default hierarchy. So the current rule is too restrictive and can waste
valuable CPU resource.

To address this issue, we are now allowing a partition to have empty
"cpuset.cpus.effective" as long as it has no task. Since cpuset is
threaded, no-internal-process rule does not apply. So it is possible
to have tasks in a partition root with child sub-partitions even though
that should be uncommon.

A parent partition with no task can now have all its CPUs distributed out
to its child partitions. The top cpuset always have some house-keeping
tasks running and so its list of effective cpu can't be empty.

Once a partition with empty "cpuset.cpus.effective" is formed, no
new task can be moved into it until "cpuset.cpus.effective" becomes
non-empty.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long
18065ebe9b cgroup/cpuset: Miscellaneous cleanups & add helper functions
The partition root state (PRS) macro names do not currently match the
external names. Change them to match the external names and add helper
functions to read or change the state.

Shorten the cpuset argument of update_parent_subparts_cpumask() to cs
to match other cpuset functions.

Remove the new_prs argument from notify_partition_change() as the
cs->partition_root_state has already been set to new_prs before it
is called.

There is no functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long
ec5fbdfb99 cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
Previously, update_tasks_cpumask() is not supposed to be called with
top cpuset. With cpuset partition that takes CPUs away from the top
cpuset, adjusting the cpus_mask of the tasks in the top cpuset is
necessary. Percpu kthreads, however, are ignored.

Fixes: ee8dde0cd2 ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Josh Don
5251c6c436 cgroup: add pids.peak interface for pids controller
pids.peak tracks the high watermark of usage for number of pids. This
helps give a better baseline on which to set pids.max. Polling
pids.current isn't really feasible, since it would potentially miss
short-lived spikes.

This interface is analogous to memory.peak.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 09:26:51 -10:00
Tejun Heo
dc79ec1b23 cgroup: Remove data-race around cgrp_dfl_visible
There's a seemingly harmless data-race around cgrp_dfl_visible detected by
kernel concurrency sanitizer. Let's remove it by throwing WRITE/READ_ONCE at
it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Abhishek Shah <abhishek.shah@columbia.edu>
Cc: Gabriel Ryan <gabe@cs.columbia.edu>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/netdev/20220819072256.fn7ctciefy4fc4cu@wittgenstein/
2022-09-04 09:16:19 -10:00
Tetsuo Handa
075b593f54 cgroup: Use cgroup_attach_{lock,unlock}() from cgroup_attach_task_all()
No behavior changes; preparing for potential locking changes in future.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by:Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-26 11:14:34 -10:00
Tejun Heo
265efc941f Merge branch 'for-6.0-fixes' into for-6.1
Pulling to receive 43626dade3 ("group: Add missing cpus_read_lock() to
cgroup_attach_task_all()") for a follow-up patch.
2022-08-26 11:13:39 -10:00
Michal Koutný
fa7e439cf9 cgroup: Homogenize cgroup_get_from_id() return value
Cgroup id is user provided datum hence extend its return domain to
include possible error reason (similar to cgroup_get_from_fd()).

This change also fixes commit d4ccaf58a8 ("bpf: Introduce cgroup
iter") that would use NULL instead of proper error handling in
d4ccaf58a8 ("bpf: Introduce cgroup iter").

Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup,
mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without
CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive
CONFIG_MEMCG respectively) transitive, so drop the singular definition
not needed with !CONFIG_CGROUPS.

Fixes: d4ccaf58a8 ("bpf: Introduce cgroup iter")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-26 10:57:41 -10:00
Michal Koutný
4534dee941 cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id
Cgroup ids are resolved in the global scope. That may be needed sometime
(in future) but currently it violates virtual view provided through
cgroup namespaces.

There are currently following users of the resolution:
- fc_appid_store
- bpf_iter_attach_cgroup
- mem_cgroup_get_from_ino

None of the is a called on behalf of kernel but the resolution is made
with proper userspace context, hence the default to current->nsproxy
makes sens. (This doesn't rule out cgroup_get_from_id with cgroup NS
parameter in the future.)

Since cgroup ids are defined on v2 hierarchy only, we simply check
existence in the cgroup namespace by looking at ancestry on the default
hierarchy.

Fixes: 6b658c4863 ("scsi: cgroup: Add cgroup_get_from_id()")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-26 10:57:10 -10:00
Michal Koutný
74e4b956eb cgroup: Honor caller's cgroup NS when resolving path
cgroup_get_from_path() is not widely used function. Its callers presume
the path is resolved under cgroup namespace. (There is one caller
currently and resolving in init NS won't make harm (netfilter). However,
future users may be subject to different effects when resolving
globally.)
Since, there's currently no use for the global resolution, modify the
existing function to take cgroup NS into account.

Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-26 10:56:26 -10:00
Tetsuo Handa
43626dade3 cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at
cpuset_attach() [1], for commit 4f7e723643 ("cgroup: Fix
threadgroup_rwsem <-> cpus_read_lock() deadlock") missed that
cpuset_attach() is also called from cgroup_attach_task_all().
Add cpus_read_lock() like what cgroup_procs_write_start() does.

Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e [1]
Reported-by: syzbot <syzbot+29d3a3b4d86c8136ad9e@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 4f7e723643 ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock")
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-25 07:36:30 -10:00
Jing-Ting Wu
763f4fb76e cgroup: Fix race condition at rebind_subsystems()
Root cause:
The rebind_subsystems() is no lock held when move css object from A
list to B list,then let B's head be treated as css node at
list_for_each_entry_rcu().

Solution:
Add grace period before invalidating the removed rstat_css_node.

Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Fixes: a7df69b81a ("cgroup: rstat: support cgroup1")
Cc: stable@vger.kernel.org # v5.13+
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-23 08:11:06 -10:00
Tejun Heo
4f7e723643 cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock
Bringing up a CPU may involve creating and destroying tasks which requires
read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
cpus_read_lock(). However, cpuset's ->attach(), which may be called with
thredagroup_rwsem write-locked, also wants to disable CPU hotplug and
acquires cpus_read_lock(), leading to a deadlock.

Fix it by guaranteeing that ->attach() is always called with CPU hotplug
disabled and removing cpus_read_lock() call from cpuset_attach().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-and-tested-by: Imran Khan <imran.f.khan@oracle.com>
Reported-and-tested-by: Xuewen Yan <xuewen.yan@unisoc.com>
Fixes: 05c7b7a92c ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
Cc: stable@vger.kernel.org # v5.17+
2022-08-17 07:36:05 -10:00
Hao Jia
76b079ef4c sched/psi: Remove unused parameter nbytes of psi_trigger_create()
psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-15 12:35:25 -10:00
Tejun Heo
7f203bc89e cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's
no advantage to remembering the IDs instead of the pointers directly and
this makes the array useless for finding an actual ancestor cgroup forcing
cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's
replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up
from cgroup_ancestor().

While at it, improve comments around cgroup_root->cgrp_ancestor_storage.

This patch shouldn't cause user-visible behavior differences.

v2: Update cgroup_ancestor() to use ->ancestors[].

v3: cgroup_root->cgrp_ancestor_storage's type is updated to match
    cgroup->ancestors[]. Better comments.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
2022-08-15 11:16:47 -10:00
Linus Torvalds
cac03ac368 Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuvmwRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gONQ/+KkkPTeKgGDvrahTfeYZlmRyvcI1R78r9
 yooa8v+DtifznBW2eXDBc8WTruzqr78VyUY+1YSjfKS6FRQWYMficJ3qk3hxgBru
 998KZbvl3jXBBlRkqgGeFlF5Ty2KaryEZgX97a7IF/0xWDgpm972jFkJ/KCo/YTY
 WSQrzutz2FKe71EjK4cAplYxPZIiy/zo2hSGTbsso4M7bO5VLc1Y4qMtFGcCZ7JB
 s9JYkj2Rfz+AS5wioDRcGuec4A4SrroxKszZA6QDDBuhMJukqexO02xs/fxZ2W4Z
 DF4U5MFOrtz9AWSGsf1P6XXbgJO8qTgQXZchFsEcJwypV13w8U0IViXQfD/Pvx2X
 y+WHdnZVIO2sDwOJ15ew7IuoJZ2LsVygrBNFJJaIFOtIz3RzprI0BJN7LeWFALOa
 IPmbtiY8hVwhKmjRgMHWDwJhMEHLuhGx3idiD89w1pknzTUnKDiwLyEUtyynxeGd
 ft9uCvPefrYQVx9AiH7wf0W+fg334FCccC+0f8LyduyftUyQCfZIZY6LUSKuKded
 Odm7k0ngLDPbdZwAHs0Nf/ilRwd91Z7b6hGt5U3ptx+8BPMKB+/k1VoKog7OISPc
 zGaP7DrtuC4sEdX4X6bqX+mEQhpkLcQw15gVGxhKoHqygWNSZrV634aSSXwfVXJx
 eT5m/K9a7L0=
 =CYl5
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix
  and a comment fix"

* tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Do not requeue task on CPU excluded from cpus_mask
  sched/rt: Fix Sparse warnings due to undefined rt.c declarations
  exit: Fix typo in comment: s/sub-theads/sub-threads
  sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
2022-08-06 17:34:06 -07:00
Linus Torvalds
b6bb70f9ab Several core optimizations:
* threadgroup_rwsem write locking is skipped when configuring controllers in
   empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common
   static usage pattern to not grab threadgroup_rwsem at all (glibc still
   doesn't seem ready for CLONE_INTO_CGROUP unfortunately).
 
 * threadgroup_rwsem used to be put into non-percpu mode by default due to
   latency concerns in specific use cases. There's no reason for everyone
   else to pay for it. Make the behavior optional.
 
 * psi no longer allocates memory when disabled.
 
 along with some code cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYugHIQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGd+oAP9lfD3fTRdNo4qWV2VsZsYzoOxzNIuJSwN/dnYx
 IEbQOwD/cd2YMfeo6zcb427U/VfTFqjJjFK04OeljYtJU8fFywo=
 =sucy
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:
 "Several core optimizations:

   - threadgroup_rwsem write locking is skipped when configuring
     controllers in empty subtrees.

     Combined with CLONE_INTO_CGROUP, this allows the common static
     usage pattern to not grab threadgroup_rwsem at all (glibc still
     doesn't seem ready for CLONE_INTO_CGROUP unfortunately).

   - threadgroup_rwsem used to be put into non-percpu mode by default
     due to latency concerns in specific use cases. There's no reason
     for everyone else to pay for it. Make the behavior optional.

   - psi no longer allocates memory when disabled.

  ... along with some code cleanups"

* tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Skip subtree root in cgroup_update_dfl_csses()
  cgroup: remove "no" prefixed mount options
  cgroup: Make !percpu threadgroup_rwsem operations optional
  cgroup: Add "no" prefixed mount options
  cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
  cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst
  cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes
  psi: dont alloc memory for psi by default
2022-08-03 09:45:08 -07:00
Waiman Long
b6e8d40d43 sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:

	[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
	  :
	[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
	  :
	[80468.207946] Call Trace:
	[80468.208947]  cpuset_can_attach+0xa0/0x140
	[80468.209953]  cgroup_migrate_execute+0x8c/0x490
	[80468.210931]  cgroup_update_dfl_csses+0x254/0x270
	[80468.211898]  cgroup_subtree_control_write+0x322/0x400
	[80468.212854]  kernfs_fop_write_iter+0x11c/0x1b0
	[80468.213777]  new_sync_write+0x11f/0x1b0
	[80468.214689]  vfs_write+0x1eb/0x280
	[80468.215592]  ksys_write+0x5f/0xe0
	[80468.216463]  do_syscall_64+0x5c/0x80
	[80468.224287]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.

Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.

Fixes: 7f51412a41 ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
2022-08-03 10:34:26 +02:00
Linus Torvalds
b167fdffe9 This cycle's scheduler updates for v6.0 are:
Load-balancing improvements:
 ============================
 
 - Improve NUMA balancing on AMD Zen systems for affine workloads.
 
 - Improve the handling of reduced-capacity CPUs in load-balancing.
 
 - Energy Model improvements: fix & refine all the energy fairness metrics (PELT),
   and remove the conservative threshold requiring 6% energy savings to
   migrate a task. Doing this improves power efficiency for most workloads,
   and also increases the reliability of energy-efficiency scheduling.
 
 - Optimize/tweak select_idle_cpu() to spend (much) less time searching
   for an idle CPU on overloaded systems. There's reports of several
   milliseconds spent there on large systems with large workloads ...
 
   [ Since the search logic changed, there might be behavioral side effects. ]
 
 - Improve NUMA imbalance behavior. On certain systems
   with spare capacity, initial placement of tasks is non-deterministic,
   and such an artificial placement imbalance can persist for a long time,
   hurting (and sometimes helping) performance.
 
   The fix is to make fork-time task placement consistent with runtime
   NUMA balancing placement.
 
   Note that some performance regressions were reported against this,
   caused by workloads that are not memory bandwith limited, which benefit
   from the artificial locality of the placement bug(s). Mel Gorman's
   conclusion, with which we concur, was that consistency is better than
   random workload benefits from non-deterministic bugs:
 
      "Given there is no crystal ball and it's a tradeoff, I think it's
       better to be consistent and use similar logic at both fork time
       and runtime even if it doesn't have universal benefit."
 
 - Improve core scheduling by fixing a bug in sched_core_update_cookie() that
   caused unnecessary forced idling.
 
 - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly
   woken tasks.
 
 - Fix a newidle balancing bug that introduced unnecessary wakeup latencies.
 
 ABI improvements/fixes:
 =======================
 
 - Do not check capabilities and do not issue capability check denial messages
   when a scheduler syscall doesn't require privileges. (Such as increasing niceness.)
 
 - Add forced-idle accounting to cgroups too.
 
 - Fix/improve the RSEQ ABI to not just silently accept unknown flags.
   (No existing tooling is known to have learned to rely on the previous behavior.)
 
 - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
 
 Optimizations:
 ==============
 
 - Optimize & simplify leaf_cfs_rq_list()
 
 - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
 
 Misc fixes & cleanups:
 ======================
 
 - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
 
 - Fix a full-NOHZ bug that can in some cases result in the tick not being
   re-enabled when the last SCHED_RT task is gone from a runqueue but there's
   still SCHED_OTHER tasks around.
 
 - Various PREEMPT_RT related fixes.
 
 - Misc cleanups & smaller fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn2ywRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iNfxAAhPJMwM4tYCpIM6PhmxKiHl6kkiT2tt42
 HhEmiJVLjczLybWaWwmGA2dSFkv1f4+hG7nqdZTm9QYn0Pqat2UTSRcwoKQc+gpB
 x85Hwt2IUmnUman52fRl5r1miH9LTdCI6agWaFLQae5ds1XmOugFo52t2ahax+Gn
 dB8LxS2fa/GrKj229EhkJSPWAK4Y94asoTProwpKLuKEeXhDkqUNrOWbKhz+wEnA
 pVZySpA9uEOdNLVSr1s0VB6mZoh5/z6yQefj5YSNntsG71XWo9jxKCIm5buVdk2U
 wjdn6UzoTThOy/5Ygm64eYRexMHG71UamF1JYUdmvDeUJZ5fhG6RD0FECUQNVcJB
 Msu2fce6u1AV0giZGYtiooLGSawB/+e6MoDkjTl8guFHi/peve9CezKX1ZgDWPfE
 eGn+EbYkUS9RMafXCKuEUBAC1UUqAavGN9sGGN1ufyR4za6ogZplOqAFKtTRTGnT
 /Ne3fHTtvv73DLGW9ohO5vSS2Rp7zhAhB6FunhibhxCWlt7W6hA4Ze2vU9hf78Yn
 SJDLAJjOEilLaKUkRG/d9uM3FjKJM1tqxuT76+sUbM0MNxdyiKcviQlP1b8oq5Um
 xE1KNZUevnr/WXqOTGDKHH/HNPFgwxbwavMiP7dNFn8h/hEk4t9dkf5siDmVHtn4
 nzDVOob1LgE=
 =xr2b
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
"Load-balancing improvements:

   - Improve NUMA balancing on AMD Zen systems for affine workloads.

   - Improve the handling of reduced-capacity CPUs in load-balancing.

   - Energy Model improvements: fix & refine all the energy fairness
     metrics (PELT), and remove the conservative threshold requiring 6%
     energy savings to migrate a task. Doing this improves power
     efficiency for most workloads, and also increases the reliability
     of energy-efficiency scheduling.

   - Optimize/tweak select_idle_cpu() to spend (much) less time
     searching for an idle CPU on overloaded systems. There's reports of
     several milliseconds spent there on large systems with large
     workloads ...

     [ Since the search logic changed, there might be behavioral side
       effects. ]

   - Improve NUMA imbalance behavior. On certain systems with spare
     capacity, initial placement of tasks is non-deterministic, and such
     an artificial placement imbalance can persist for a long time,
     hurting (and sometimes helping) performance.

     The fix is to make fork-time task placement consistent with runtime
     NUMA balancing placement.

     Note that some performance regressions were reported against this,
     caused by workloads that are not memory bandwith limited, which
     benefit from the artificial locality of the placement bug(s). Mel
     Gorman's conclusion, with which we concur, was that consistency is
     better than random workload benefits from non-deterministic bugs:

        "Given there is no crystal ball and it's a tradeoff, I think
         it's better to be consistent and use similar logic at both fork
         time and runtime even if it doesn't have universal benefit."

   - Improve core scheduling by fixing a bug in
     sched_core_update_cookie() that caused unnecessary forced idling.

   - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs
     for newly woken tasks.

   - Fix a newidle balancing bug that introduced unnecessary wakeup
     latencies.

  ABI improvements/fixes:

   - Do not check capabilities and do not issue capability check denial
     messages when a scheduler syscall doesn't require privileges. (Such
     as increasing niceness.)

   - Add forced-idle accounting to cgroups too.

   - Fix/improve the RSEQ ABI to not just silently accept unknown flags.
     (No existing tooling is known to have learned to rely on the
     previous behavior.)

   - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.

  Optimizations:

   - Optimize & simplify leaf_cfs_rq_list()

   - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().

  Misc fixes & cleanups:

   - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.

   - Fix a full-NOHZ bug that can in some cases result in the tick not
     being re-enabled when the last SCHED_RT task is gone from a
     runqueue but there's still SCHED_OTHER tasks around.

   - Various PREEMPT_RT related fixes.

   - Misc cleanups & smaller fixes"

* tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  rseq: Kill process when unknown flags are encountered in ABI structures
  rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags
  sched/core: Fix the bug that task won't enqueue into core tree when update cookie
  nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
  sched/core: Always flush pending blk_plug
  sched/fair: fix case with reduced capacity CPU
  sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
  sched/core: add forced idle accounting for cgroups
  sched/fair: Remove the energy margin in feec()
  sched/fair: Remove task_util from effective utilization in feec()
  sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
  sched/fair: Rename select_idle_mask to select_rq_mask
  sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
  sched/fair: Decay task PELT values during wakeup migration
  sched/fair: Provide u64 read for 32-bits arch helper
  sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
  sched: only perform capability check on privileged operation
  sched: Remove unused function group_first_cpu()
  sched/fair: Remove redundant word " *"
  selftests/rseq: check if libc rseq support is registered
  ...
2022-08-01 11:49:06 -07:00
Waiman Long
265792d0de cgroup: Skip subtree root in cgroup_update_dfl_csses()
The cgroup_update_dfl_csses() function updates css associations when a
cgroup's subtree_control file is modified. Any changes made to a cgroup's
subtree_control file, however, will only affect its descendants but not
the cgroup itself. So there is no point in migrating csses associated
with that cgroup. We can skip them instead.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-07-28 07:26:30 -10:00
Tejun Heo
c808f46323 cgroup: remove "no" prefixed mount options
30312730bd ("cgroup: Add "no" prefixed mount options") added "no" prefixed
mount options to allow turning them off and 6a010a49b6 ("cgroup: Make
!percpu threadgroup_rwsem operations optional") added one more "no" prefixed
mount option. However, Michal pointed out that the "no" prefixed options
aren't necessary in allowing mount options to be turned off:

  # grep group /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,relatime,nsdelegate,memory_recursiveprot 0 0
  # mount -o remount,nsdelegate,memory_recursiveprot none /sys/fs/cgroup
  # grep cgroup /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,relatime,nsdelegate,memory_recursiveprot 0 0

Note that this is different from the remount behavior when the mount(1) is
invoked without the device argument - "none":

 # grep cgroup /proc/mounts
 cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
 # mount -o remount,nsdelegate,memory_recursiveprot /sys/fs/cgroup
 # grep cgroup /proc/mounts
 cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0

While a bit confusing, given that there is a way to turn off the options,
there's no reason to have the explicit "no" prefixed options. Let's remove
them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-07-27 07:54:55 -10:00
Tejun Heo
6a010a49b6 cgroup: Make !percpu threadgroup_rwsem operations optional
3942a9bd7b ("locking, rcu, cgroup: Avoid synchronize_sched() in
__cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
because the impiled synchronize_rcu() on write locking was pushing up the
latencies too much for android which constantly moves processes between
cgroups.

This makes the hotter paths - fork and exit - slower as they're always
forced into the slow path. There is no reason to force this on everyone
especially given that more common static usage pattern can now completely
avoid write-locking the rwsem. Write-locking is elided when turning on and
off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
cgroup without grabbing the rwsem.

Restore the default percpu operations and introduce the mount option
"favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
lower latencies for the dynamic operations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutn� <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
2022-07-23 04:29:02 -10:00
Tejun Heo
30312730bd cgroup: Add "no" prefixed mount options
We allow modifying these mount options via remount. Let's add "no" prefixed
variants so that they can be turned off too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
2022-07-22 19:12:52 -10:00
Tejun Heo
671c11f061 cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
cgroup_update_dfl_csses() write-lock the threadgroup_rwsem as updating the
csses can trigger process migrations. However, if the subtree doesn't
contain any tasks, there aren't gonna be any cgroup migrations. This
condition can be trivially detected by testing whether
mgctx.preloaded_src_csets is empty. Elide write-locking threadgroup_rwsem if
the subtree is empty.

After this optimization, the usage pattern of creating a cgroup, enabling
the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and
then removing the cgroup after it becomes empty doesn't need to write-lock
threadgroup_rwsem at all.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
2022-07-22 19:12:37 -10:00
Josh Don
1fcf54deb7 sched/core: add forced idle accounting for cgroups
4feee7d126 previously added per-task forced idle accounting. This patch
extends this to also include cgroups.

rstat is used for cgroup accounting, except for the root, which uses
kcpustat in order to bypass the need for doing an rstat flush when
reading root stats.

Only cgroup v2 is supported. Similar to the task accounting, the cgroup
accounting requires that schedstats is enabled.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20220629211426.3329954-1-joshdon@google.com
2022-07-04 09:23:07 +02:00
Lin Feng
d75cd55ae2 cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst
We have:
int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp)
{
...
	/* mixables don't care */
	if (cgroup_is_mixable(dst_cgrp))
		return 0;

	/*
	 * If @dst_cgrp is already or can become a thread root or is
	 * threaded, it doesn't matter.
	 */
	if (cgroup_can_be_thread_root(dst_cgrp) || cgroup_is_threaded(dst_cgrp))
		return 0;
...
}

but in fact the entry of cgroup_can_be_thread_root() covers case that
checking cgroup_is_mixable() as following:
static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
{
        /* mixables don't care */
        if (cgroup_is_mixable(cgrp))
                return true;
...
}

so explicitly checking in cgroup_migrate_vet_dst is unnecessary.

Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-06-27 18:09:21 +09:00
Tejun Heo
07fd5b6cdf cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.

Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:

 #1> mkdir -p /sys/fs/cgroup/misc/a/b
 #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
 #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
 #4> PID=$!
 #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
 #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.

After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:

 1. cgroup_migrate_add_src() is called on B for the leader.

 2. cgroup_migrate_add_src() is called on A for the other threads.

 3. cgroup_migrate_prepare_dst() is called. It scans the src list.

 4. It notices that B wants to migrate to A, so it tries to A to the dst
    list but realizes that its ->mg_preload_node is already busy.

 5. and then it notices A wants to migrate to A as it's an identity
    migration, it culls it by list_del_init()'ing its ->mg_preload_node and
    putting references accordingly.

 6. The rest of migration takes place with B on the src list but nothing on
    the dst list.

This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.

This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.

This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de9851 ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-16 09:37:02 -10:00
Lin Feng
e210a89f5b cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes
No funtionality change, but save us some lines.

Signed-off-by: Lin Feng <linf@wangsu.com>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-06-16 09:33:53 -10:00
Chen Wandun
5f69a6577b psi: dont alloc memory for psi by default
Memory about struct psi_group is allocated by default for
each cgroup even if psi_disabled is true, in this case, these
allocated memory is waste, so alloc memory for struct psi_group
only when psi_disabled is false.

Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-06-07 07:11:47 -10:00
Linus Torvalds
8b49c4b1b6 Merge branch 'for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing too interesting. This adds cpu controller selftests and there
  are a couple code cleanup patches"

* 'for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: remove the superfluous judgment
  cgroup: Make cgroup_debug static
  kseltest/cgroup: Make test_stress.sh work if run interactively
  kselftest/cgroup: fix test_stress.sh to use OUTPUT dir
  cgroup: Add config file to cgroup selftest suite
  cgroup: Add test_cpucg_max_nested() testcase
  cgroup: Add test_cpucg_max() testcase
  cgroup: Add test_cpucg_nested_weight_underprovisioned() testcase
  cgroup: Adding test_cpucg_nested_weight_overprovisioned() testcase
  cgroup: Add test_cpucg_weight_underprovisioned() testcase
  cgroup: Add test_cpucg_weight_overprovisioned() testcase
  cgroup: Add test_cpucg_stats() testcase to cgroup cpu selftests
  cgroup: Add new test_cpu.c test suite in cgroup selftests
2022-05-25 11:47:25 -07:00
Shida Zhang
b154a017c9 cgroup: remove the superfluous judgment
Remove the superfluous judgment since the function is
never called for a root cgroup, as suggested by Tejun.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-05-19 21:49:45 -10:00
Xiu Jianfeng
29ed17389c cgroup: Make cgroup_debug static
Make cgroup_debug static since it's only used in cgroup.c

Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-05-18 06:59:20 -10:00
Waiman Long
2685027fca cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
There are 3 places where the cpu and node masks of the top cpuset can
be initialized in the order they are executed:
 1) start_kernel -> cpuset_init()
 2) start_kernel -> cgroup_init() -> cpuset_bind()
 3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()

The first cpuset_init() call just sets all the bits in the masks.
The second cpuset_bind() call sets cpus_allowed and mems_allowed to the
default v2 values. The third cpuset_init_smp() call sets them back to
v1 values.

For systems with cgroup v2 setup, cpuset_bind() is called once.  As a
result, cpu and memory node hot add may fail to update the cpu and node
masks of the top cpuset to include the newly added cpu or node in a
cgroup v2 environment.

For systems with cgroup v1 setup, cpuset_bind() is called again by
rebind_subsystem() when the v1 cpuset filesystem is mounted as shown
in the dmesg log below with an instrumented kernel.

  [    2.609781] cpuset_bind() called - v2 = 1
  [    3.079473] cpuset_init_smp() called
  [    7.103710] cpuset_bind() called - v2 = 0

smp_init() is called after the first two init functions.  So we don't
have a complete list of active cpus and memory nodes until later in
cpuset_init_smp() which is the right time to set up effective_cpus
and effective_mems.

To fix this cgroup v2 mask setup problem, the potentially incorrect
cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed.
For cgroup v2 systems, the initial cpuset_bind() call will set the masks
correctly.  For cgroup v1 systems, the second call to cpuset_bind()
will do the right setup.

cc: stable@vger.kernel.org
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-05-05 08:57:00 -10:00
Linus Torvalds
266d17a8c0 Driver core changes for 5.18-rc1
Here is the set of driver core changes for 5.18-rc1.
 
 Not much here, primarily it was a bunch of cleanups and small updates:
 	- kobj_type cleanups for default_groups
 	- documentation updates
 	- firmware loader minor changes
 	- component common helper added and take advantage of it in many
 	  drivers (the largest part of this pull request).
 
 There will be a merge conflict in drivers/power/supply/ab8500_chargalg.c
 with your tree, the merge conflict should be easy (take all the
 changes).
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG6PA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylMFwCfSIyAU4oLEgj+/Rfmx4o45cAVIWMAnit3zbdU
 wUUCGqKcOnTJEcW6dMPh
 =1VVi
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the set of driver core changes for 5.18-rc1.

  Not much here, primarily it was a bunch of cleanups and small updates:

   - kobj_type cleanups for default_groups

   - documentation updates

   - firmware loader minor changes

   - component common helper added and take advantage of it in many
     drivers (the largest part of this pull request).

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits)
  Documentation: update stable review cycle documentation
  drivers/base/dd.c : Remove the initial value of the global variable
  Documentation: update stable tree link
  Documentation: add link to stable release candidate tree
  devres: fix typos in comments
  Documentation: add note block surrounding security patch note
  samples/kobject: Use sysfs_emit instead of sprintf
  base: soc: Make soc_device_match() simpler and easier to read
  driver core: dd: fix return value of __setup handler
  driver core: Refactor sysfs and drv/bus remove hooks
  driver core: Refactor multiple copies of device cleanup
  scripts: get_abi.pl: Fix typo in help message
  kernfs: fix typos in comments
  kernfs: remove unneeded #if 0 guard
  ALSA: hda/realtek: Make use of the helper component_compare_dev_name
  video: omapfb: dss: Make use of the helper component_compare_dev
  power: supply: ab8500: Make use of the helper component_compare_dev
  ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of
  iommu/mediatek: Make use of the helper component_compare/release_of
  drm: of: Make use of the helper component_release_of
  ...
2022-03-28 12:41:28 -07:00
Linus Torvalds
52deda9551 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "Various misc subsystems, before getting into the post-linux-next
  material.

  41 patches.

  Subsystems affected by this patch series: procfs, misc, core-kernel,
  lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump,
  taskstats, panic, kcov, resource, and ubsan"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
  Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang"
  kernel/resource: fix kfree() of bootmem memory again
  kcov: properly handle subsequent mmap calls
  kcov: split ioctl handling into locked and unlocked parts
  panic: move panic_print before kmsg dumpers
  panic: add option to dump all CPUs backtraces in panic_print
  docs: sysctl/kernel: add missing bit to panic_print
  taskstats: remove unneeded dead assignment
  kasan: no need to unset panic_on_warn in end_report()
  ubsan: no need to unset panic_on_warn in ubsan_epilogue()
  panic: unset panic_on_warn inside panic()
  docs: kdump: add scp example to write out the dump file
  docs: kdump: update description about sysfs file system support
  arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
  x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
  riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
  kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible
  cgroup: use irqsave in cgroup_rstat_flush_locked().
  fat: use pointer to simple type in put_user()
  minix: fix bug when opening a file with O_DIRECT
  ...
2022-03-24 14:14:07 -07:00
Sebastian Andrzej Siewior
b1e2c8df0f cgroup: use irqsave in cgroup_rstat_flush_locked().
All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock
either with spin_lock_irq() or spin_lock_irqsave().
cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which
is a raw_spin_lock.  This lock is also acquired in
cgroup_rstat_updated() in IRQ context and therefore requires _irqsave()
locking suffix in cgroup_rstat_flush_locked().

Since there is no difference between spin_lock_t and raw_spin_lock_t on
!RT lockdep does not complain here.  On RT lockdep complains because the
interrupts were not disabled here and a deadlock is possible.

Acquire the raw_spin_lock_t with disabled interrupts.

Link: https://lkml.kernel.org/r/20220301122143.1521823-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: cgroup: add a comment to cgroup_rstat_flush_locked().

Add a comment why spin_lock_irq() -> raw_spin_lock_irqsave() is needed.

Link: https://lkml.kernel.org/r/Yh+DOK73hfVV5ThX@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23 19:00:34 -07:00
Linus Torvalds
2fce7ea0e0 Merge branch 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "All trivial cleanups without meaningful behavior changes"

* 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: cleanup comments
  cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc comment
  cgroup: rstat: retrieve current bstat to delta directly
  cgroup: rstat: use same convention to assign cgroup_base_stat
2022-03-23 12:43:35 -07:00
Ingo Molnar
ccdbf33c23 Linux 5.17-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmIuUskeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCFkH/2n3mpGXuITp0ZXE
 TNrpbdZOof5SgLw+w7THswXuo6m5yRGNKQs9fvIvDD8Vf7/OdQQfPOmF1cIE5+nk
 wcz6aHKbdrok8Jql2qjJqWXZ5xbGj6qywg3zZrwOUsCKFP5p+AjBJcmZOsvQHjSp
 ASODy1moOlK+nO52TrMaJw74a8xQPmQiNa+T2P+FedEYjlcRH/c7hLJ7GEnL6+cC
 /R4bATZq3tiInbTBlkC0hR0iVNgRXwXNyv9PEXrYYYHnekh8G1mgSNf06iejLcsG
 aAYsW9NyPxu8zPhhHNx79K9o8BMtxGD4YQpsfdfIEnf9Q3euqAKe2evRWqHHlDms
 RuSCtsc=
 =M9Nc
 -----END PGP SIGNATURE-----

Merge tag 'v5.17-rc8' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-15 10:28:12 +01:00
Tom Rix
f9da322e86 cgroup: cleanup comments
for spdx, add a space before //

replacements
judgement to judgment
transofrmed to transformed
partitition to partition
histrical to historical
migratecd to migrated

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-03-13 19:19:27 -10:00
Greg Kroah-Hartman
4a248f85b3 Merge 5.17-rc6 into driver-core-next
We need the driver core fix in here as well for future changes.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-28 07:45:41 +01:00