License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2016-12-27 14:49:06 -05:00
|
|
|
#ifndef __CGROUP_INTERNAL_H
|
|
|
|
#define __CGROUP_INTERNAL_H
|
|
|
|
|
|
|
|
#include <linux/cgroup.h>
|
|
|
|
#include <linux/kernfs.h>
|
|
|
|
#include <linux/workqueue.h>
|
|
|
|
#include <linux/list.h>
|
2017-03-08 10:00:40 +02:00
|
|
|
#include <linux/refcount.h>
|
2019-09-07 07:23:15 -04:00
|
|
|
#include <linux/fs_parser.h>
|
2016-12-27 14:49:06 -05:00
|
|
|
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 17:48:54 -04:00
|
|
|
#define TRACE_CGROUP_PATH_LEN 1024
|
|
|
|
extern spinlock_t trace_cgroup_path_lock;
|
|
|
|
extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
|
2018-11-08 10:08:46 -05:00
|
|
|
extern void __init enable_debug_cgroup(void);
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 17:48:54 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* cgroup_path() takes a spin lock. It is good practice not to take
|
|
|
|
* spin locks within trace point handlers, as they are mostly hidden
|
|
|
|
* from normal view. As cgroup_path() can take the kernfs_rename_lock
|
|
|
|
* spin lock, it is best to not call that function from the trace event
|
|
|
|
* handler.
|
|
|
|
*
|
|
|
|
* Note: trace_cgroup_##type##_enabled() is a static branch that will only
|
|
|
|
* be set when the trace event is enabled.
|
|
|
|
*/
|
|
|
|
#define TRACE_CGROUP_PATH(type, cgrp, ...) \
|
|
|
|
do { \
|
|
|
|
if (trace_cgroup_##type##_enabled()) { \
|
2019-04-19 10:03:07 -07:00
|
|
|
unsigned long flags; \
|
|
|
|
spin_lock_irqsave(&trace_cgroup_path_lock, \
|
|
|
|
flags); \
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 17:48:54 -04:00
|
|
|
cgroup_path(cgrp, trace_cgroup_path, \
|
|
|
|
TRACE_CGROUP_PATH_LEN); \
|
|
|
|
trace_cgroup_##type(cgrp, trace_cgroup_path, \
|
|
|
|
##__VA_ARGS__); \
|
2019-04-19 10:03:07 -07:00
|
|
|
spin_unlock_irqrestore(&trace_cgroup_path_lock, \
|
|
|
|
flags); \
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 17:48:54 -04:00
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
|
2019-01-05 00:38:03 -05:00
|
|
|
/*
|
|
|
|
* The cgroup filesystem superblock creation/mount context.
|
|
|
|
*/
|
|
|
|
struct cgroup_fs_context {
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
struct kernfs_fs_context kfc;
|
2019-01-17 02:25:51 -05:00
|
|
|
struct cgroup_root *root;
|
2019-01-17 10:14:26 -05:00
|
|
|
struct cgroup_namespace *ns;
|
2019-01-16 23:42:38 -05:00
|
|
|
unsigned int flags; /* CGRP_ROOT_* flags */
|
|
|
|
|
|
|
|
/* cgroup1 bits */
|
|
|
|
bool cpuset_clone_children;
|
|
|
|
bool none; /* User explicitly requested empty subsystem */
|
|
|
|
bool all_ss; /* Seen 'all' option */
|
|
|
|
u16 subsys_mask; /* Selected subsystems */
|
|
|
|
char *name; /* Hierarchy name */
|
|
|
|
char *release_agent; /* Path for release notifications */
|
2019-01-05 00:38:03 -05:00
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc)
|
|
|
|
{
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
struct kernfs_fs_context *kfc = fc->fs_private;
|
|
|
|
|
|
|
|
return container_of(kfc, struct cgroup_fs_context, kfc);
|
2019-01-05 00:38:03 -05:00
|
|
|
}
|
|
|
|
|
2022-01-06 11:02:29 -10:00
|
|
|
struct cgroup_pidlist;
|
|
|
|
|
|
|
|
struct cgroup_file_ctx {
|
2022-01-06 11:02:29 -10:00
|
|
|
struct cgroup_namespace *ns;
|
|
|
|
|
2022-01-06 11:02:29 -10:00
|
|
|
struct {
|
|
|
|
void *trigger;
|
|
|
|
} psi;
|
|
|
|
|
|
|
|
struct {
|
|
|
|
bool started;
|
|
|
|
struct css_task_iter iter;
|
|
|
|
} procs;
|
|
|
|
|
|
|
|
struct {
|
|
|
|
struct cgroup_pidlist *pidlist;
|
|
|
|
} procs1;
|
mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
This patch (of 2):
Other mechanisms for querying the peak memory usage of either a process or
v1 memory cgroup allow for resetting the high watermark. Restore parity
with those mechanisms, but with a less racy API.
For example:
- Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
the high watermark.
- writing "5" to the clear_refs pseudo-file in a processes's proc
directory resets the peak RSS.
This change is an evolution of a previous patch, which mostly copied the
cgroup v1 behavior, however, there were concerns about races/ownership
issues with a global reset, so instead this change makes the reset
filedescriptor-local.
Writing any non-empty string to the memory.peak and memory.swap.peak
pseudo-files reset the high watermark to the current usage for subsequent
reads through that same FD.
Notably, following Johannes's suggestion, this implementation moves the
O(FDs that have written) behavior onto the FD write(2) path. Instead, on
the page-allocation path, we simply add one additional watermark to
conditionally bump per-hierarchy level in the page-counter.
Additionally, this takes Longman's suggestion of nesting the
page-charging-path checks for the two watermarks to reduce the number of
common-case comparisons.
This behavior is particularly useful for work scheduling systems that need
to track memory usage of worker processes/cgroups per-work-item. Since
memory can't be squeezed like CPU can (the OOM-killer has opinions), these
systems need to track the peak memory usage to compute system/container
fullness when binpacking workitems.
Most notably, Vimeo's use-case involves a system that's doing global
binpacking across many Kubernetes pods/containers, and while we can use
PSI for some local decisions about overload, we strive to avoid packing
workloads too tightly in the first place. To facilitate this, we track
the peak memory usage. However, since we run with long-lived workers (to
amortize startup costs) we need a way to track the high watermark while a
work-item is executing. Polling runs the risk of missing short spikes
that last for timescales below the polling interval, and peak memory
tracking at the cgroup level is otherwise perfect for this use-case.
As this data is used to ensure that binpacked work ends up with sufficient
headroom, this use-case mostly avoids the inaccuracies surrounding
reclaimable memory.
Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com
Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com
Signed-off-by: David Finkel <davidf@vimeo.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-29 10:37:42 -04:00
|
|
|
|
|
|
|
struct cgroup_of_peak peak;
|
2022-01-06 11:02:29 -10:00
|
|
|
};
|
|
|
|
|
2016-12-27 14:49:06 -05:00
|
|
|
/*
|
|
|
|
* A cgroup can be associated with multiple css_sets as different tasks may
|
|
|
|
* belong to different cgroups on different hierarchies. In the other
|
|
|
|
* direction, a css_set is naturally associated with multiple cgroups.
|
|
|
|
* This M:N relationship is represented by the following link structure
|
|
|
|
* which exists for each association and allows traversing the associations
|
|
|
|
* from both sides.
|
|
|
|
*/
|
|
|
|
struct cgrp_cset_link {
|
|
|
|
/* the cgroup and css_set this link associates */
|
|
|
|
struct cgroup *cgrp;
|
|
|
|
struct css_set *cset;
|
|
|
|
|
|
|
|
/* list of cgrp_cset_links anchored at cgrp->cset_links */
|
|
|
|
struct list_head cset_link;
|
|
|
|
|
|
|
|
/* list of cgrp_cset_links anchored at css_set->cgrp_links */
|
|
|
|
struct list_head cgrp_link;
|
|
|
|
};
|
|
|
|
|
2017-01-15 19:03:41 -05:00
|
|
|
/* used to track tasks and csets during migration */
|
|
|
|
struct cgroup_taskset {
|
|
|
|
/* the src and dst cset list running through cset->mg_node */
|
|
|
|
struct list_head src_csets;
|
|
|
|
struct list_head dst_csets;
|
|
|
|
|
cgroup: don't call migration methods if there are no tasks to migrate
Subsystem migration methods shouldn't be called for empty migrations.
cgroup_migrate_execute() implements this guarantee by bailing early if
there are no source css_sets. This used to be correct before
a79a908fd2b0 ("cgroup: introduce cgroup namespaces"), but no longer
since the commit because css_sets can stay pinned without tasks in
them.
This caused cgroup_migrate_execute() call into cpuset migration
methods with an empty cgroup_taskset. cpuset migration methods
correctly assume that cgroup_taskset_first() never returns NULL;
however, due to the bug, it can, leading to the following oops.
Unable to handle kernel paging request for data at address 0x00000960
Faulting instruction address: 0xc0000000001d6868
Oops: Kernel access of bad area, sig: 11 [#1]
...
CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W
4.12.0-rc4-next-20170609 #2
Workqueue: events cpuset_hotplug_workfn
task: c00000000ca60580 task.stack: c00000000c728000
NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810
REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000
CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1
GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000
GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678
GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000
GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180
GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0
GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20
GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20
GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000
NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0
LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0
Call Trace:
[c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable)
[c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450
[c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360
[c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20
[c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580
[c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0
[c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0
[c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74
Instruction dump:
f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25
60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200
---[ end trace dcaaf98fb36d9e64 ]---
This patch fixes the bug by adding an explicit nr_tasks counter to
cgroup_taskset and skipping calling the migration methods if the
counter is zero. While at it, remove the now spurious check on no
source css_sets.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
2017-07-08 07:17:02 -04:00
|
|
|
/* the number of tasks in the set */
|
|
|
|
int nr_tasks;
|
|
|
|
|
2017-01-15 19:03:41 -05:00
|
|
|
/* the subsys currently being processed */
|
|
|
|
int ssid;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fields for cgroup_taskset_*() iteration.
|
|
|
|
*
|
|
|
|
* Before migration is committed, the target migration tasks are on
|
|
|
|
* ->mg_tasks of the csets on ->src_csets. After, on ->mg_tasks of
|
|
|
|
* the csets on ->dst_csets. ->csets point to either ->src_csets
|
|
|
|
* or ->dst_csets depending on whether migration is committed.
|
|
|
|
*
|
|
|
|
* ->cur_csets and ->cur_task point to the current task position
|
|
|
|
* during iteration.
|
|
|
|
*/
|
|
|
|
struct list_head *csets;
|
|
|
|
struct css_set *cur_cset;
|
|
|
|
struct task_struct *cur_task;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* migration context also tracks preloading */
|
|
|
|
struct cgroup_mgctx {
|
|
|
|
/*
|
|
|
|
* Preloaded source and destination csets. Used to guarantee
|
|
|
|
* atomic success or failure on actual migration.
|
|
|
|
*/
|
|
|
|
struct list_head preloaded_src_csets;
|
|
|
|
struct list_head preloaded_dst_csets;
|
|
|
|
|
|
|
|
/* tasks and csets to migrate */
|
|
|
|
struct cgroup_taskset tset;
|
2017-01-15 19:03:41 -05:00
|
|
|
|
|
|
|
/* subsystems affected by migration */
|
|
|
|
u16 ss_mask;
|
2017-01-15 19:03:41 -05:00
|
|
|
};
|
|
|
|
|
|
|
|
#define CGROUP_TASKSET_INIT(tset) \
|
|
|
|
{ \
|
|
|
|
.src_csets = LIST_HEAD_INIT(tset.src_csets), \
|
|
|
|
.dst_csets = LIST_HEAD_INIT(tset.dst_csets), \
|
|
|
|
.csets = &tset.src_csets, \
|
|
|
|
}
|
|
|
|
|
|
|
|
#define CGROUP_MGCTX_INIT(name) \
|
|
|
|
{ \
|
|
|
|
LIST_HEAD_INIT(name.preloaded_src_csets), \
|
|
|
|
LIST_HEAD_INIT(name.preloaded_dst_csets), \
|
|
|
|
CGROUP_TASKSET_INIT(name.tset), \
|
|
|
|
}
|
|
|
|
|
|
|
|
#define DEFINE_CGROUP_MGCTX(name) \
|
|
|
|
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
|
|
|
|
|
2016-12-27 14:49:06 -05:00
|
|
|
extern struct cgroup_subsys *cgroup_subsys[];
|
|
|
|
extern struct list_head cgroup_roots;
|
|
|
|
|
|
|
|
/* iterate across the hierarchies */
|
|
|
|
#define for_each_root(root) \
|
2023-10-29 06:14:29 +00:00
|
|
|
list_for_each_entry_rcu((root), &cgroup_roots, root_list, \
|
|
|
|
lockdep_is_held(&cgroup_mutex))
|
2016-12-27 14:49:06 -05:00
|
|
|
|
|
|
|
/**
|
|
|
|
* for_each_subsys - iterate all enabled cgroup subsystems
|
|
|
|
* @ss: the iteration cursor
|
|
|
|
* @ssid: the index of @ss, CGROUP_SUBSYS_COUNT after reaching the end
|
|
|
|
*/
|
|
|
|
#define for_each_subsys(ss, ssid) \
|
|
|
|
for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT && \
|
|
|
|
(((ss) = cgroup_subsys[ssid]) || true); (ssid)++)
|
|
|
|
|
|
|
|
static inline bool cgroup_is_dead(const struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
return !(cgrp->self.flags & CSS_ONLINE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool notify_on_release(const struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
|
|
|
|
}
|
|
|
|
|
2016-12-27 14:49:09 -05:00
|
|
|
void put_css_set_locked(struct css_set *cset);
|
|
|
|
|
|
|
|
static inline void put_css_set(struct css_set *cset)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure that the refcount doesn't hit zero while any readers
|
|
|
|
* can see it. Similar to atomic_dec_and_lock(), but for an
|
|
|
|
* rwlock
|
|
|
|
*/
|
2017-03-08 10:00:40 +02:00
|
|
|
if (refcount_dec_not_one(&cset->refcount))
|
2016-12-27 14:49:09 -05:00
|
|
|
return;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&css_set_lock, flags);
|
|
|
|
put_css_set_locked(cset);
|
|
|
|
spin_unlock_irqrestore(&css_set_lock, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* refcounted get/put for css_set objects
|
|
|
|
*/
|
|
|
|
static inline void get_css_set(struct css_set *cset)
|
|
|
|
{
|
2017-03-08 10:00:40 +02:00
|
|
|
refcount_inc(&cset->refcount);
|
2016-12-27 14:49:09 -05:00
|
|
|
}
|
|
|
|
|
2016-12-27 14:49:06 -05:00
|
|
|
bool cgroup_ssid_enabled(int ssid);
|
|
|
|
bool cgroup_on_dfl(const struct cgroup *cgrp);
|
|
|
|
|
|
|
|
struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root);
|
|
|
|
struct cgroup *task_cgroup_from_root(struct task_struct *task,
|
|
|
|
struct cgroup_root *root);
|
|
|
|
struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, bool drain_offline);
|
|
|
|
void cgroup_kn_unlock(struct kernfs_node *kn);
|
|
|
|
int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
|
|
|
|
struct cgroup_namespace *ns);
|
|
|
|
|
2022-07-23 04:28:28 -10:00
|
|
|
void cgroup_favor_dynmods(struct cgroup_root *root, bool favor);
|
2016-12-27 14:49:08 -05:00
|
|
|
void cgroup_free_root(struct cgroup_root *root);
|
2019-01-17 02:25:51 -05:00
|
|
|
void init_cgroup_root(struct cgroup_fs_context *ctx);
|
2019-01-12 00:20:54 -05:00
|
|
|
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
|
2016-12-27 14:49:06 -05:00
|
|
|
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
|
2019-01-17 10:14:26 -05:00
|
|
|
int cgroup_do_get_tree(struct fs_context *fc);
|
2016-12-27 14:49:06 -05:00
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 11:14:51 -04:00
|
|
|
int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);
|
2017-01-15 19:03:41 -05:00
|
|
|
void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
|
|
|
|
void cgroup_migrate_add_src(struct css_set *src_cset, struct cgroup *dst_cgrp,
|
|
|
|
struct cgroup_mgctx *mgctx);
|
|
|
|
int cgroup_migrate_prepare_dst(struct cgroup_mgctx *mgctx);
|
2016-12-27 14:49:06 -05:00
|
|
|
int cgroup_migrate(struct task_struct *leader, bool threadgroup,
|
2017-01-15 19:03:41 -05:00
|
|
|
struct cgroup_mgctx *mgctx);
|
2016-12-27 14:49:06 -05:00
|
|
|
|
|
|
|
int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
|
|
|
|
bool threadgroup);
|
2022-08-26 11:48:29 +09:00
|
|
|
void cgroup_attach_lock(bool lock_threadgroup);
|
|
|
|
void cgroup_attach_unlock(bool lock_threadgroup);
|
cgroup: Optimize single thread migration
There are reports of users who use thread migrations between cgroups and
they report performance drop after d59cfc09c32a ("sched, cgroup: replace
signal_struct->group_rwsem with a global percpu_rwsem"). The effect is
pronounced on machines with more CPUs.
The migration is affected by forking noise happening in the background,
after the mentioned commit a migrating thread must wait for all
(forking) processes on the system, not only of its threadgroup.
There are several places that need to synchronize with migration:
a) do_exit,
b) de_thread,
c) copy_process,
d) cgroup_update_dfl_csses,
e) parallel migration (cgroup_{proc,thread}s_write).
In the case of self-migrating thread, we relax the synchronization on
cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are
excluded with cgroup_mutex, c) does not matter in case of single thread
migration and the executing thread cannot exec(2) or exit(2) while it is
writing into cgroup.threads. In case of do_exit because of signal
delivery, we either exit before the migration or finish the migration
(of not yet PF_EXITING thread) and die afterwards.
This patch handles only the case of self-migration by writing "0" into
cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem
with numeric PIDs.
This change improves migration dependent workload performance similar
to per-signal_struct state.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-04 12:57:40 +02:00
|
|
|
struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
|
|
|
|
bool *locked)
|
2017-05-15 09:34:00 -04:00
|
|
|
__acquires(&cgroup_threadgroup_rwsem);
|
cgroup: Optimize single thread migration
There are reports of users who use thread migrations between cgroups and
they report performance drop after d59cfc09c32a ("sched, cgroup: replace
signal_struct->group_rwsem with a global percpu_rwsem"). The effect is
pronounced on machines with more CPUs.
The migration is affected by forking noise happening in the background,
after the mentioned commit a migrating thread must wait for all
(forking) processes on the system, not only of its threadgroup.
There are several places that need to synchronize with migration:
a) do_exit,
b) de_thread,
c) copy_process,
d) cgroup_update_dfl_csses,
e) parallel migration (cgroup_{proc,thread}s_write).
In the case of self-migrating thread, we relax the synchronization on
cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are
excluded with cgroup_mutex, c) does not matter in case of single thread
migration and the executing thread cannot exec(2) or exit(2) while it is
writing into cgroup.threads. In case of do_exit because of signal
delivery, we either exit before the migration or finish the migration
(of not yet PF_EXITING thread) and die afterwards.
This patch handles only the case of self-migration by writing "0" into
cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem
with numeric PIDs.
This change improves migration dependent workload performance similar
to per-signal_struct state.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-04 12:57:40 +02:00
|
|
|
void cgroup_procs_write_finish(struct task_struct *task, bool locked)
|
2017-05-15 09:34:00 -04:00
|
|
|
__releases(&cgroup_threadgroup_rwsem);
|
2016-12-27 14:49:06 -05:00
|
|
|
|
|
|
|
void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
|
|
|
|
|
2016-12-27 14:49:08 -05:00
|
|
|
int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode);
|
|
|
|
int cgroup_rmdir(struct kernfs_node *kn);
|
|
|
|
int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
|
|
|
|
struct kernfs_root *kf_root);
|
|
|
|
|
2019-04-19 10:03:02 -07:00
|
|
|
int __cgroup_task_count(const struct cgroup *cgrp);
|
2017-06-13 17:18:02 -04:00
|
|
|
int cgroup_task_count(const struct cgroup *cgrp);
|
|
|
|
|
2017-09-25 08:12:05 -07:00
|
|
|
/*
|
2018-04-26 14:29:04 -07:00
|
|
|
* rstat.c
|
2017-09-25 08:12:05 -07:00
|
|
|
*/
|
2018-04-26 14:29:04 -07:00
|
|
|
int cgroup_rstat_init(struct cgroup *cgrp);
|
|
|
|
void cgroup_rstat_exit(struct cgroup *cgrp);
|
|
|
|
void cgroup_rstat_boot(void);
|
2018-04-26 14:29:05 -07:00
|
|
|
void cgroup_base_stat_cputime_show(struct seq_file *seq);
|
2017-09-25 08:12:05 -07:00
|
|
|
|
2016-12-27 14:49:09 -05:00
|
|
|
/*
|
|
|
|
* namespace.c
|
|
|
|
*/
|
|
|
|
extern const struct proc_ns_operations cgroupns_operations;
|
|
|
|
|
2016-12-27 14:49:06 -05:00
|
|
|
/*
|
|
|
|
* cgroup-v1.c
|
|
|
|
*/
|
2016-12-27 14:49:08 -05:00
|
|
|
extern struct cftype cgroup1_base_files[];
|
2016-12-27 14:49:08 -05:00
|
|
|
extern struct kernfs_syscall_ops cgroup1_kf_syscall_ops;
|
2019-09-07 07:23:15 -04:00
|
|
|
extern const struct fs_parameter_spec cgroup1_fs_parameters[];
|
2016-12-27 14:49:06 -05:00
|
|
|
|
2018-05-15 15:57:23 +02:00
|
|
|
int proc_cgroupstats_show(struct seq_file *m, void *v);
|
2016-12-27 14:49:08 -05:00
|
|
|
bool cgroup1_ssid_disabled(int ssid);
|
|
|
|
void cgroup1_pidlist_destroy_all(struct cgroup *cgrp);
|
|
|
|
void cgroup1_release_agent(struct work_struct *work);
|
|
|
|
void cgroup1_check_for_release(struct cgroup *cgrp);
|
2019-01-17 00:15:11 -05:00
|
|
|
int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param);
|
2019-01-16 21:23:02 -05:00
|
|
|
int cgroup1_get_tree(struct fs_context *fc);
|
2019-01-05 00:38:03 -05:00
|
|
|
int cgroup1_reconfigure(struct fs_context *ctx);
|
2016-12-27 14:49:06 -05:00
|
|
|
|
|
|
|
#endif /* __CGROUP_INTERNAL_H */
|