mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-01 10:45:49 +00:00
docs: scheduler: convert docs to ReST and rename to *.rst
In order to prepare to add them to the Kernel API book, convert the files to ReST format. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
parent
d223884089
commit
d6a3b24762
@ -11,4 +11,4 @@ Description:
|
|||||||
example would be, if User A has shares = 1024 and user
|
example would be, if User A has shares = 1024 and user
|
||||||
B has shares = 2048, User B will get twice the CPU
|
B has shares = 2048, User B will get twice the CPU
|
||||||
bandwidth user A will. For more details refer
|
bandwidth user A will. For more details refer
|
||||||
Documentation/scheduler/sched-design-CFS.txt
|
Documentation/scheduler/sched-design-CFS.rst
|
||||||
|
@ -1,3 +1,4 @@
|
|||||||
|
================================================
|
||||||
Completions - "wait for completion" barrier APIs
|
Completions - "wait for completion" barrier APIs
|
||||||
================================================
|
================================================
|
||||||
|
|
||||||
@ -46,7 +47,7 @@ it has to wait for it.
|
|||||||
|
|
||||||
To use completions you need to #include <linux/completion.h> and
|
To use completions you need to #include <linux/completion.h> and
|
||||||
create a static or dynamic variable of type 'struct completion',
|
create a static or dynamic variable of type 'struct completion',
|
||||||
which has only two fields:
|
which has only two fields::
|
||||||
|
|
||||||
struct completion {
|
struct completion {
|
||||||
unsigned int done;
|
unsigned int done;
|
||||||
@ -57,7 +58,7 @@ This provides the ->wait waitqueue to place tasks on for waiting (if any), and
|
|||||||
the ->done completion flag for indicating whether it's completed or not.
|
the ->done completion flag for indicating whether it's completed or not.
|
||||||
|
|
||||||
Completions should be named to refer to the event that is being synchronized on.
|
Completions should be named to refer to the event that is being synchronized on.
|
||||||
A good example is:
|
A good example is::
|
||||||
|
|
||||||
wait_for_completion(&early_console_added);
|
wait_for_completion(&early_console_added);
|
||||||
|
|
||||||
@ -81,7 +82,7 @@ have taken place, even if these wait functions return prematurely due to a timeo
|
|||||||
or a signal triggering.
|
or a signal triggering.
|
||||||
|
|
||||||
Initializing of dynamically allocated completion objects is done via a call to
|
Initializing of dynamically allocated completion objects is done via a call to
|
||||||
init_completion():
|
init_completion()::
|
||||||
|
|
||||||
init_completion(&dynamic_object->done);
|
init_completion(&dynamic_object->done);
|
||||||
|
|
||||||
@ -100,7 +101,8 @@ but be aware of other races.
|
|||||||
|
|
||||||
For static declaration and initialization, macros are available.
|
For static declaration and initialization, macros are available.
|
||||||
|
|
||||||
For static (or global) declarations in file scope you can use DECLARE_COMPLETION():
|
For static (or global) declarations in file scope you can use
|
||||||
|
DECLARE_COMPLETION()::
|
||||||
|
|
||||||
static DECLARE_COMPLETION(setup_done);
|
static DECLARE_COMPLETION(setup_done);
|
||||||
DECLARE_COMPLETION(setup_done);
|
DECLARE_COMPLETION(setup_done);
|
||||||
@ -111,7 +113,7 @@ initialized to 'not done' and doesn't require an init_completion() call.
|
|||||||
When a completion is declared as a local variable within a function,
|
When a completion is declared as a local variable within a function,
|
||||||
then the initialization should always use DECLARE_COMPLETION_ONSTACK()
|
then the initialization should always use DECLARE_COMPLETION_ONSTACK()
|
||||||
explicitly, not just to make lockdep happy, but also to make it clear
|
explicitly, not just to make lockdep happy, but also to make it clear
|
||||||
that limited scope had been considered and is intentional:
|
that limited scope had been considered and is intentional::
|
||||||
|
|
||||||
DECLARE_COMPLETION_ONSTACK(setup_done)
|
DECLARE_COMPLETION_ONSTACK(setup_done)
|
||||||
|
|
||||||
@ -140,11 +142,11 @@ Waiting for completions:
|
|||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
For a thread to wait for some concurrent activity to finish, it
|
For a thread to wait for some concurrent activity to finish, it
|
||||||
calls wait_for_completion() on the initialized completion structure:
|
calls wait_for_completion() on the initialized completion structure::
|
||||||
|
|
||||||
void wait_for_completion(struct completion *done)
|
void wait_for_completion(struct completion *done)
|
||||||
|
|
||||||
A typical usage scenario is:
|
A typical usage scenario is::
|
||||||
|
|
||||||
CPU#1 CPU#2
|
CPU#1 CPU#2
|
||||||
|
|
||||||
@ -192,17 +194,17 @@ A common problem that occurs is to have unclean assignment of return types,
|
|||||||
so take care to assign return-values to variables of the proper type.
|
so take care to assign return-values to variables of the proper type.
|
||||||
|
|
||||||
Checking for the specific meaning of return values also has been found
|
Checking for the specific meaning of return values also has been found
|
||||||
to be quite inaccurate, e.g. constructs like:
|
to be quite inaccurate, e.g. constructs like::
|
||||||
|
|
||||||
if (!wait_for_completion_interruptible_timeout(...))
|
if (!wait_for_completion_interruptible_timeout(...))
|
||||||
|
|
||||||
... would execute the same code path for successful completion and for the
|
... would execute the same code path for successful completion and for the
|
||||||
interrupted case - which is probably not what you want.
|
interrupted case - which is probably not what you want::
|
||||||
|
|
||||||
int wait_for_completion_interruptible(struct completion *done)
|
int wait_for_completion_interruptible(struct completion *done)
|
||||||
|
|
||||||
This function marks the task TASK_INTERRUPTIBLE while it is waiting.
|
This function marks the task TASK_INTERRUPTIBLE while it is waiting.
|
||||||
If a signal was received while waiting it will return -ERESTARTSYS; 0 otherwise.
|
If a signal was received while waiting it will return -ERESTARTSYS; 0 otherwise::
|
||||||
|
|
||||||
unsigned long wait_for_completion_timeout(struct completion *done, unsigned long timeout)
|
unsigned long wait_for_completion_timeout(struct completion *done, unsigned long timeout)
|
||||||
|
|
||||||
@ -214,7 +216,7 @@ Timeouts are preferably calculated with msecs_to_jiffies() or usecs_to_jiffies()
|
|||||||
to make the code largely HZ-invariant.
|
to make the code largely HZ-invariant.
|
||||||
|
|
||||||
If the returned timeout value is deliberately ignored a comment should probably explain
|
If the returned timeout value is deliberately ignored a comment should probably explain
|
||||||
why (e.g. see drivers/mfd/wm8350-core.c wm8350_read_auxadc()).
|
why (e.g. see drivers/mfd/wm8350-core.c wm8350_read_auxadc())::
|
||||||
|
|
||||||
long wait_for_completion_interruptible_timeout(struct completion *done, unsigned long timeout)
|
long wait_for_completion_interruptible_timeout(struct completion *done, unsigned long timeout)
|
||||||
|
|
||||||
@ -225,14 +227,14 @@ jiffies if completion occurred.
|
|||||||
|
|
||||||
Further variants include _killable which uses TASK_KILLABLE as the
|
Further variants include _killable which uses TASK_KILLABLE as the
|
||||||
designated tasks state and will return -ERESTARTSYS if it is interrupted,
|
designated tasks state and will return -ERESTARTSYS if it is interrupted,
|
||||||
or 0 if completion was achieved. There is a _timeout variant as well:
|
or 0 if completion was achieved. There is a _timeout variant as well::
|
||||||
|
|
||||||
long wait_for_completion_killable(struct completion *done)
|
long wait_for_completion_killable(struct completion *done)
|
||||||
long wait_for_completion_killable_timeout(struct completion *done, unsigned long timeout)
|
long wait_for_completion_killable_timeout(struct completion *done, unsigned long timeout)
|
||||||
|
|
||||||
The _io variants wait_for_completion_io() behave the same as the non-_io
|
The _io variants wait_for_completion_io() behave the same as the non-_io
|
||||||
variants, except for accounting waiting time as 'waiting on IO', which has
|
variants, except for accounting waiting time as 'waiting on IO', which has
|
||||||
an impact on how the task is accounted in scheduling/IO stats:
|
an impact on how the task is accounted in scheduling/IO stats::
|
||||||
|
|
||||||
void wait_for_completion_io(struct completion *done)
|
void wait_for_completion_io(struct completion *done)
|
||||||
unsigned long wait_for_completion_io_timeout(struct completion *done, unsigned long timeout)
|
unsigned long wait_for_completion_io_timeout(struct completion *done, unsigned long timeout)
|
||||||
@ -243,11 +245,11 @@ Signaling completions:
|
|||||||
|
|
||||||
A thread that wants to signal that the conditions for continuation have been
|
A thread that wants to signal that the conditions for continuation have been
|
||||||
achieved calls complete() to signal exactly one of the waiters that it can
|
achieved calls complete() to signal exactly one of the waiters that it can
|
||||||
continue:
|
continue::
|
||||||
|
|
||||||
void complete(struct completion *done)
|
void complete(struct completion *done)
|
||||||
|
|
||||||
... or calls complete_all() to signal all current and future waiters:
|
... or calls complete_all() to signal all current and future waiters::
|
||||||
|
|
||||||
void complete_all(struct completion *done)
|
void complete_all(struct completion *done)
|
||||||
|
|
||||||
@ -276,14 +278,14 @@ try_wait_for_completion()/completion_done():
|
|||||||
|
|
||||||
The try_wait_for_completion() function will not put the thread on the wait
|
The try_wait_for_completion() function will not put the thread on the wait
|
||||||
queue but rather returns false if it would need to enqueue (block) the thread,
|
queue but rather returns false if it would need to enqueue (block) the thread,
|
||||||
else it consumes one posted completion and returns true.
|
else it consumes one posted completion and returns true::
|
||||||
|
|
||||||
bool try_wait_for_completion(struct completion *done)
|
bool try_wait_for_completion(struct completion *done)
|
||||||
|
|
||||||
Finally, to check the state of a completion without changing it in any way,
|
Finally, to check the state of a completion without changing it in any way,
|
||||||
call completion_done(), which returns false if there are no posted
|
call completion_done(), which returns false if there are no posted
|
||||||
completions that were not yet consumed by waiters (implying that there are
|
completions that were not yet consumed by waiters (implying that there are
|
||||||
waiters) and true otherwise;
|
waiters) and true otherwise::
|
||||||
|
|
||||||
bool completion_done(struct completion *done)
|
bool completion_done(struct completion *done)
|
||||||
|
|
29
Documentation/scheduler/index.rst
Normal file
29
Documentation/scheduler/index.rst
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
:orphan:
|
||||||
|
|
||||||
|
===============
|
||||||
|
Linux Scheduler
|
||||||
|
===============
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
|
||||||
|
completion
|
||||||
|
sched-arch
|
||||||
|
sched-bwc
|
||||||
|
sched-deadline
|
||||||
|
sched-design-CFS
|
||||||
|
sched-domains
|
||||||
|
sched-energy
|
||||||
|
sched-nice-design
|
||||||
|
sched-rt-group
|
||||||
|
sched-stats
|
||||||
|
|
||||||
|
text_files
|
||||||
|
|
||||||
|
.. only:: subproject and html
|
||||||
|
|
||||||
|
Indices
|
||||||
|
=======
|
||||||
|
|
||||||
|
* :ref:`genindex`
|
@ -1,4 +1,6 @@
|
|||||||
CPU Scheduler implementation hints for architecture specific code
|
=================================================================
|
||||||
|
CPU Scheduler implementation hints for architecture specific code
|
||||||
|
=================================================================
|
||||||
|
|
||||||
Nick Piggin, 2005
|
Nick Piggin, 2005
|
||||||
|
|
||||||
@ -35,9 +37,10 @@ Your cpu_idle routines need to obey the following rules:
|
|||||||
4. The only time interrupts need to be disabled when checking
|
4. The only time interrupts need to be disabled when checking
|
||||||
need_resched is if we are about to sleep the processor until
|
need_resched is if we are about to sleep the processor until
|
||||||
the next interrupt (this doesn't provide any protection of
|
the next interrupt (this doesn't provide any protection of
|
||||||
need_resched, it prevents losing an interrupt).
|
need_resched, it prevents losing an interrupt):
|
||||||
|
|
||||||
|
4a. Common problem with this type of sleep appears to be::
|
||||||
|
|
||||||
4a. Common problem with this type of sleep appears to be:
|
|
||||||
local_irq_disable();
|
local_irq_disable();
|
||||||
if (!need_resched()) {
|
if (!need_resched()) {
|
||||||
local_irq_enable();
|
local_irq_enable();
|
||||||
@ -51,7 +54,7 @@ Your cpu_idle routines need to obey the following rules:
|
|||||||
although it may be reasonable to do some background work or enter
|
although it may be reasonable to do some background work or enter
|
||||||
a low CPU priority.
|
a low CPU priority.
|
||||||
|
|
||||||
5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
|
- 5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
|
||||||
an interrupt sleep, it needs to be cleared then a memory
|
an interrupt sleep, it needs to be cleared then a memory
|
||||||
barrier issued (followed by a test of need_resched with
|
barrier issued (followed by a test of need_resched with
|
||||||
interrupts disabled, as explained in 3).
|
interrupts disabled, as explained in 3).
|
||||||
@ -71,4 +74,3 @@ sh64 - Is sleeping racy vs interrupts? (See #4a)
|
|||||||
|
|
||||||
sparc - IRQs on at this point(?), change local_irq_save to _disable.
|
sparc - IRQs on at this point(?), change local_irq_save to _disable.
|
||||||
- TODO: needs secondary CPUs to disable preempt (See #1)
|
- TODO: needs secondary CPUs to disable preempt (See #1)
|
||||||
|
|
@ -1,8 +1,9 @@
|
|||||||
|
=====================
|
||||||
CFS Bandwidth Control
|
CFS Bandwidth Control
|
||||||
=====================
|
=====================
|
||||||
|
|
||||||
[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
|
[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
|
||||||
The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
|
The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ]
|
||||||
|
|
||||||
CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
|
CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
|
||||||
specification of the maximum CPU bandwidth available to a group or hierarchy.
|
specification of the maximum CPU bandwidth available to a group or hierarchy.
|
||||||
@ -27,7 +28,8 @@ cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
|
|||||||
cpu.cfs_period_us: the length of a period (in microseconds)
|
cpu.cfs_period_us: the length of a period (in microseconds)
|
||||||
cpu.stat: exports throttling statistics [explained further below]
|
cpu.stat: exports throttling statistics [explained further below]
|
||||||
|
|
||||||
The default values are:
|
The default values are::
|
||||||
|
|
||||||
cpu.cfs_period_us=100ms
|
cpu.cfs_period_us=100ms
|
||||||
cpu.cfs_quota=-1
|
cpu.cfs_quota=-1
|
||||||
|
|
||||||
@ -55,7 +57,8 @@ For efficiency run-time is transferred between the global pool and CPU local
|
|||||||
on large systems. The amount transferred each time such an update is required
|
on large systems. The amount transferred each time such an update is required
|
||||||
is described as the "slice".
|
is described as the "slice".
|
||||||
|
|
||||||
This is tunable via procfs:
|
This is tunable via procfs::
|
||||||
|
|
||||||
/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
|
/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
|
||||||
|
|
||||||
Larger slice values will reduce transfer overheads, while smaller values allow
|
Larger slice values will reduce transfer overheads, while smaller values allow
|
||||||
@ -66,6 +69,7 @@ Statistics
|
|||||||
A group's bandwidth statistics are exported via 3 fields in cpu.stat.
|
A group's bandwidth statistics are exported via 3 fields in cpu.stat.
|
||||||
|
|
||||||
cpu.stat:
|
cpu.stat:
|
||||||
|
|
||||||
- nr_periods: Number of enforcement intervals that have elapsed.
|
- nr_periods: Number of enforcement intervals that have elapsed.
|
||||||
- nr_throttled: Number of times the group has been throttled/limited.
|
- nr_throttled: Number of times the group has been throttled/limited.
|
||||||
- throttled_time: The total time duration (in nanoseconds) for which entities
|
- throttled_time: The total time duration (in nanoseconds) for which entities
|
||||||
@ -78,12 +82,15 @@ Hierarchical considerations
|
|||||||
The interface enforces that an individual entity's bandwidth is always
|
The interface enforces that an individual entity's bandwidth is always
|
||||||
attainable, that is: max(c_i) <= C. However, over-subscription in the
|
attainable, that is: max(c_i) <= C. However, over-subscription in the
|
||||||
aggregate case is explicitly allowed to enable work-conserving semantics
|
aggregate case is explicitly allowed to enable work-conserving semantics
|
||||||
within a hierarchy.
|
within a hierarchy:
|
||||||
|
|
||||||
e.g. \Sum (c_i) may exceed C
|
e.g. \Sum (c_i) may exceed C
|
||||||
|
|
||||||
[ Where C is the parent's bandwidth, and c_i its children ]
|
[ Where C is the parent's bandwidth, and c_i its children ]
|
||||||
|
|
||||||
|
|
||||||
There are two ways in which a group may become throttled:
|
There are two ways in which a group may become throttled:
|
||||||
|
|
||||||
a. it fully consumes its own quota within a period
|
a. it fully consumes its own quota within a period
|
||||||
b. a parent's quota is fully consumed within its period
|
b. a parent's quota is fully consumed within its period
|
||||||
|
|
||||||
@ -92,7 +99,7 @@ be allowed to until the parent's runtime is refreshed.
|
|||||||
|
|
||||||
Examples
|
Examples
|
||||||
--------
|
--------
|
||||||
1. Limit a group to 1 CPU worth of runtime.
|
1. Limit a group to 1 CPU worth of runtime::
|
||||||
|
|
||||||
If period is 250ms and quota is also 250ms, the group will get
|
If period is 250ms and quota is also 250ms, the group will get
|
||||||
1 CPU worth of runtime every 250ms.
|
1 CPU worth of runtime every 250ms.
|
||||||
@ -100,10 +107,10 @@ Examples
|
|||||||
# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
|
# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
|
||||||
# echo 250000 > cpu.cfs_period_us /* period = 250ms */
|
# echo 250000 > cpu.cfs_period_us /* period = 250ms */
|
||||||
|
|
||||||
2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
|
2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine
|
||||||
|
|
||||||
With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
|
With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
|
||||||
runtime every 500ms.
|
runtime every 500ms::
|
||||||
|
|
||||||
# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
|
# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
|
||||||
# echo 500000 > cpu.cfs_period_us /* period = 500ms */
|
# echo 500000 > cpu.cfs_period_us /* period = 500ms */
|
||||||
@ -112,11 +119,10 @@ Examples
|
|||||||
|
|
||||||
3. Limit a group to 20% of 1 CPU.
|
3. Limit a group to 20% of 1 CPU.
|
||||||
|
|
||||||
With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
|
With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU::
|
||||||
|
|
||||||
# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
|
# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
|
||||||
# echo 50000 > cpu.cfs_period_us /* period = 50ms */
|
# echo 50000 > cpu.cfs_period_us /* period = 50ms */
|
||||||
|
|
||||||
By using a small period here we are ensuring a consistent latency
|
By using a small period here we are ensuring a consistent latency
|
||||||
response at the expense of burst capacity.
|
response at the expense of burst capacity.
|
||||||
|
|
@ -1,8 +1,8 @@
|
|||||||
Deadline Task Scheduling
|
========================
|
||||||
------------------------
|
Deadline Task Scheduling
|
||||||
|
========================
|
||||||
|
|
||||||
CONTENTS
|
.. CONTENTS
|
||||||
========
|
|
||||||
|
|
||||||
0. WARNING
|
0. WARNING
|
||||||
1. Overview
|
1. Overview
|
||||||
@ -44,7 +44,7 @@ CONTENTS
|
|||||||
|
|
||||||
|
|
||||||
2. Scheduling algorithm
|
2. Scheduling algorithm
|
||||||
==================
|
=======================
|
||||||
|
|
||||||
2.1 Main algorithm
|
2.1 Main algorithm
|
||||||
------------------
|
------------------
|
||||||
@ -80,7 +80,7 @@ CONTENTS
|
|||||||
a "remaining runtime". These two parameters are initially set to 0;
|
a "remaining runtime". These two parameters are initially set to 0;
|
||||||
|
|
||||||
- When a SCHED_DEADLINE task wakes up (becomes ready for execution),
|
- When a SCHED_DEADLINE task wakes up (becomes ready for execution),
|
||||||
the scheduler checks if
|
the scheduler checks if::
|
||||||
|
|
||||||
remaining runtime runtime
|
remaining runtime runtime
|
||||||
---------------------------------- > ---------
|
---------------------------------- > ---------
|
||||||
@ -97,7 +97,7 @@ CONTENTS
|
|||||||
left unchanged;
|
left unchanged;
|
||||||
|
|
||||||
- When a SCHED_DEADLINE task executes for an amount of time t, its
|
- When a SCHED_DEADLINE task executes for an amount of time t, its
|
||||||
remaining runtime is decreased as
|
remaining runtime is decreased as::
|
||||||
|
|
||||||
remaining runtime = remaining runtime - t
|
remaining runtime = remaining runtime - t
|
||||||
|
|
||||||
@ -112,7 +112,7 @@ CONTENTS
|
|||||||
|
|
||||||
- When the current time is equal to the replenishment time of a
|
- When the current time is equal to the replenishment time of a
|
||||||
throttled task, the scheduling deadline and the remaining runtime are
|
throttled task, the scheduling deadline and the remaining runtime are
|
||||||
updated as
|
updated as::
|
||||||
|
|
||||||
scheduling deadline = scheduling deadline + period
|
scheduling deadline = scheduling deadline + period
|
||||||
remaining runtime = remaining runtime + runtime
|
remaining runtime = remaining runtime + runtime
|
||||||
@ -129,7 +129,7 @@ CONTENTS
|
|||||||
Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
|
Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
|
||||||
when flag SCHED_FLAG_RECLAIM is set.
|
when flag SCHED_FLAG_RECLAIM is set.
|
||||||
|
|
||||||
The following diagram illustrates the state names for tasks handled by GRUB:
|
The following diagram illustrates the state names for tasks handled by GRUB::
|
||||||
|
|
||||||
------------
|
------------
|
||||||
(d) | Active |
|
(d) | Active |
|
||||||
@ -168,7 +168,7 @@ CONTENTS
|
|||||||
breaking the real-time guarantees.
|
breaking the real-time guarantees.
|
||||||
|
|
||||||
The 0-lag time for a task entering the ActiveNonContending state is
|
The 0-lag time for a task entering the ActiveNonContending state is
|
||||||
computed as
|
computed as::
|
||||||
|
|
||||||
(runtime * dl_period)
|
(runtime * dl_period)
|
||||||
deadline - ---------------------
|
deadline - ---------------------
|
||||||
@ -222,7 +222,7 @@ CONTENTS
|
|||||||
|
|
||||||
|
|
||||||
Let's now see a trivial example of two deadline tasks with runtime equal
|
Let's now see a trivial example of two deadline tasks with runtime equal
|
||||||
to 4 and period equal to 8 (i.e., bandwidth equal to 0.5):
|
to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
|
||||||
|
|
||||||
A Task T1
|
A Task T1
|
||||||
|
|
|
|
||||||
@ -284,7 +284,7 @@ CONTENTS
|
|||||||
|
|
||||||
|
|
||||||
2.3 Energy-aware scheduling
|
2.3 Energy-aware scheduling
|
||||||
------------------------
|
---------------------------
|
||||||
|
|
||||||
When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the
|
When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the
|
||||||
GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum
|
GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum
|
||||||
@ -299,15 +299,20 @@ CONTENTS
|
|||||||
3. Scheduling Real-Time Tasks
|
3. Scheduling Real-Time Tasks
|
||||||
=============================
|
=============================
|
||||||
|
|
||||||
* BIG FAT WARNING ******************************************************
|
|
||||||
*
|
|
||||||
* This section contains a (not-thorough) summary on classical deadline
|
.. BIG FAT WARNING ******************************************************
|
||||||
* scheduling theory, and how it applies to SCHED_DEADLINE.
|
|
||||||
* The reader can "safely" skip to Section 4 if only interested in seeing
|
.. warning::
|
||||||
* how the scheduling policy can be used. Anyway, we strongly recommend
|
|
||||||
* to come back here and continue reading (once the urge for testing is
|
This section contains a (not-thorough) summary on classical deadline
|
||||||
* satisfied :P) to be sure of fully understanding all technical details.
|
scheduling theory, and how it applies to SCHED_DEADLINE.
|
||||||
************************************************************************
|
The reader can "safely" skip to Section 4 if only interested in seeing
|
||||||
|
how the scheduling policy can be used. Anyway, we strongly recommend
|
||||||
|
to come back here and continue reading (once the urge for testing is
|
||||||
|
satisfied :P) to be sure of fully understanding all technical details.
|
||||||
|
|
||||||
|
.. ************************************************************************
|
||||||
|
|
||||||
There are no limitations on what kind of task can exploit this new
|
There are no limitations on what kind of task can exploit this new
|
||||||
scheduling discipline, even if it must be said that it is particularly
|
scheduling discipline, even if it must be said that it is particularly
|
||||||
@ -329,6 +334,7 @@ CONTENTS
|
|||||||
sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
|
sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
|
||||||
d_j = r_j + D, where D is the task's relative deadline.
|
d_j = r_j + D, where D is the task's relative deadline.
|
||||||
Summing up, a real-time task can be described as
|
Summing up, a real-time task can be described as
|
||||||
|
|
||||||
Task = (WCET, D, P)
|
Task = (WCET, D, P)
|
||||||
|
|
||||||
The utilization of a real-time task is defined as the ratio between its
|
The utilization of a real-time task is defined as the ratio between its
|
||||||
@ -352,13 +358,15 @@ CONTENTS
|
|||||||
between the finishing time of a job and its absolute deadline).
|
between the finishing time of a job and its absolute deadline).
|
||||||
More precisely, it can be proven that using a global EDF scheduler the
|
More precisely, it can be proven that using a global EDF scheduler the
|
||||||
maximum tardiness of each task is smaller or equal than
|
maximum tardiness of each task is smaller or equal than
|
||||||
|
|
||||||
((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
|
((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
|
||||||
|
|
||||||
where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
|
where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
|
||||||
is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
|
is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
|
||||||
utilization[12].
|
utilization[12].
|
||||||
|
|
||||||
3.2 Schedulability Analysis for Uniprocessor Systems
|
3.2 Schedulability Analysis for Uniprocessor Systems
|
||||||
------------------------
|
----------------------------------------------------
|
||||||
|
|
||||||
If M=1 (uniprocessor system), or in case of partitioned scheduling (each
|
If M=1 (uniprocessor system), or in case of partitioned scheduling (each
|
||||||
real-time task is statically assigned to one and only one CPU), it is
|
real-time task is statically assigned to one and only one CPU), it is
|
||||||
@ -370,7 +378,9 @@ CONTENTS
|
|||||||
a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
|
a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
|
||||||
of all the tasks running on a CPU if the sum of the densities of the tasks
|
of all the tasks running on a CPU if the sum of the densities of the tasks
|
||||||
running on such a CPU is smaller or equal than 1:
|
running on such a CPU is smaller or equal than 1:
|
||||||
|
|
||||||
sum(WCET_i / min{D_i, P_i}) <= 1
|
sum(WCET_i / min{D_i, P_i}) <= 1
|
||||||
|
|
||||||
It is important to notice that this condition is only sufficient, and not
|
It is important to notice that this condition is only sufficient, and not
|
||||||
necessary: there are task sets that are schedulable, but do not respect the
|
necessary: there are task sets that are schedulable, but do not respect the
|
||||||
condition. For example, consider the task set {Task_1,Task_2} composed by
|
condition. For example, consider the task set {Task_1,Task_2} composed by
|
||||||
@ -379,7 +389,9 @@ CONTENTS
|
|||||||
(Task_1 is scheduled as soon as it is released, and finishes just in time
|
(Task_1 is scheduled as soon as it is released, and finishes just in time
|
||||||
to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
|
to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
|
||||||
its response time cannot be larger than 50ms + 10ms = 60ms) even if
|
its response time cannot be larger than 50ms + 10ms = 60ms) even if
|
||||||
|
|
||||||
50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
|
50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
|
||||||
|
|
||||||
Of course it is possible to test the exact schedulability of tasks with
|
Of course it is possible to test the exact schedulability of tasks with
|
||||||
D_i != P_i (checking a condition that is both sufficient and necessary),
|
D_i != P_i (checking a condition that is both sufficient and necessary),
|
||||||
but this cannot be done by comparing the total utilization or density with
|
but this cannot be done by comparing the total utilization or density with
|
||||||
@ -399,7 +411,7 @@ CONTENTS
|
|||||||
4 Linux uses an admission test based on the tasks' utilizations.
|
4 Linux uses an admission test based on the tasks' utilizations.
|
||||||
|
|
||||||
3.3 Schedulability Analysis for Multiprocessor Systems
|
3.3 Schedulability Analysis for Multiprocessor Systems
|
||||||
------------------------
|
------------------------------------------------------
|
||||||
|
|
||||||
On multiprocessor systems with global EDF scheduling (non partitioned
|
On multiprocessor systems with global EDF scheduling (non partitioned
|
||||||
systems), a sufficient test for schedulability can not be based on the
|
systems), a sufficient test for schedulability can not be based on the
|
||||||
@ -428,7 +440,9 @@ CONTENTS
|
|||||||
between total utilization (or density) and a fixed constant. If all tasks
|
between total utilization (or density) and a fixed constant. If all tasks
|
||||||
have D_i = P_i, a sufficient schedulability condition can be expressed in
|
have D_i = P_i, a sufficient schedulability condition can be expressed in
|
||||||
a simple way:
|
a simple way:
|
||||||
|
|
||||||
sum(WCET_i / P_i) <= M - (M - 1) · U_max
|
sum(WCET_i / P_i) <= M - (M - 1) · U_max
|
||||||
|
|
||||||
where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
|
where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
|
||||||
M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition
|
M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition
|
||||||
just confirms the Dhall's effect. A more complete survey of the literature
|
just confirms the Dhall's effect. A more complete survey of the literature
|
||||||
@ -447,7 +461,7 @@ CONTENTS
|
|||||||
the tasks are limited.
|
the tasks are limited.
|
||||||
|
|
||||||
3.4 Relationship with SCHED_DEADLINE Parameters
|
3.4 Relationship with SCHED_DEADLINE Parameters
|
||||||
------------------------
|
-----------------------------------------------
|
||||||
|
|
||||||
Finally, it is important to understand the relationship between the
|
Finally, it is important to understand the relationship between the
|
||||||
SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
|
SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
|
||||||
@ -473,6 +487,7 @@ CONTENTS
|
|||||||
this task, as it is not possible to respect its temporal constraints.
|
this task, as it is not possible to respect its temporal constraints.
|
||||||
|
|
||||||
References:
|
References:
|
||||||
|
|
||||||
1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
|
1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
|
||||||
ming in a hard-real-time environment. Journal of the Association for
|
ming in a hard-real-time environment. Journal of the Association for
|
||||||
Computing Machinery, 20(1), 1973.
|
Computing Machinery, 20(1), 1973.
|
||||||
@ -550,7 +565,7 @@ CONTENTS
|
|||||||
The interface used to control the CPU bandwidth that can be allocated
|
The interface used to control the CPU bandwidth that can be allocated
|
||||||
to -deadline tasks is similar to the one already used for -rt
|
to -deadline tasks is similar to the one already used for -rt
|
||||||
tasks with real-time group scheduling (a.k.a. RT-throttling - see
|
tasks with real-time group scheduling (a.k.a. RT-throttling - see
|
||||||
Documentation/scheduler/sched-rt-group.txt), and is based on readable/
|
Documentation/scheduler/sched-rt-group.rst), and is based on readable/
|
||||||
writable control files located in procfs (for system wide settings).
|
writable control files located in procfs (for system wide settings).
|
||||||
Notice that per-group settings (controlled through cgroupfs) are still not
|
Notice that per-group settings (controlled through cgroupfs) are still not
|
||||||
defined for -deadline tasks, because more discussion is needed in order to
|
defined for -deadline tasks, because more discussion is needed in order to
|
||||||
@ -596,11 +611,13 @@ CONTENTS
|
|||||||
Specifying a periodic/sporadic task that executes for a given amount of
|
Specifying a periodic/sporadic task that executes for a given amount of
|
||||||
runtime at each instance, and that is scheduled according to the urgency of
|
runtime at each instance, and that is scheduled according to the urgency of
|
||||||
its own timing constraints needs, in general, a way of declaring:
|
its own timing constraints needs, in general, a way of declaring:
|
||||||
|
|
||||||
- a (maximum/typical) instance execution time,
|
- a (maximum/typical) instance execution time,
|
||||||
- a minimum interval between consecutive instances,
|
- a minimum interval between consecutive instances,
|
||||||
- a time constraint by which each instance must be completed.
|
- a time constraint by which each instance must be completed.
|
||||||
|
|
||||||
Therefore:
|
Therefore:
|
||||||
|
|
||||||
* a new struct sched_attr, containing all the necessary fields is
|
* a new struct sched_attr, containing all the necessary fields is
|
||||||
provided;
|
provided;
|
||||||
* the new scheduling related syscalls that manipulate it, i.e.,
|
* the new scheduling related syscalls that manipulate it, i.e.,
|
||||||
@ -658,7 +675,7 @@ CONTENTS
|
|||||||
------------------------------------
|
------------------------------------
|
||||||
|
|
||||||
An example of a simple configuration (pin a -deadline task to CPU0)
|
An example of a simple configuration (pin a -deadline task to CPU0)
|
||||||
follows (rt-app is used to create a -deadline task).
|
follows (rt-app is used to create a -deadline task)::
|
||||||
|
|
||||||
mkdir /dev/cpuset
|
mkdir /dev/cpuset
|
||||||
mount -t cgroup -o cpuset cpuset /dev/cpuset
|
mount -t cgroup -o cpuset cpuset /dev/cpuset
|
||||||
@ -671,8 +688,8 @@ CONTENTS
|
|||||||
echo 1 > cpu0/cpuset.cpu_exclusive
|
echo 1 > cpu0/cpuset.cpu_exclusive
|
||||||
echo 1 > cpu0/cpuset.mem_exclusive
|
echo 1 > cpu0/cpuset.mem_exclusive
|
||||||
echo $$ > cpu0/tasks
|
echo $$ > cpu0/tasks
|
||||||
rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
|
rt-app -t 100000:10000:d:0 -D5 # it is now actually superfluous to specify
|
||||||
task affinity)
|
# task affinity
|
||||||
|
|
||||||
6. Future plans
|
6. Future plans
|
||||||
===============
|
===============
|
||||||
@ -711,7 +728,7 @@ Appendix A. Test suite
|
|||||||
rt-app is available at: https://github.com/scheduler-tools/rt-app.
|
rt-app is available at: https://github.com/scheduler-tools/rt-app.
|
||||||
|
|
||||||
Thread parameters can be specified from the command line, with something like
|
Thread parameters can be specified from the command line, with something like
|
||||||
this:
|
this::
|
||||||
|
|
||||||
# rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5
|
# rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5
|
||||||
|
|
||||||
@ -721,27 +738,27 @@ Appendix A. Test suite
|
|||||||
of 5 seconds.
|
of 5 seconds.
|
||||||
|
|
||||||
More interestingly, configurations can be described with a json file that
|
More interestingly, configurations can be described with a json file that
|
||||||
can be passed as input to rt-app with something like this:
|
can be passed as input to rt-app with something like this::
|
||||||
|
|
||||||
# rt-app my_config.json
|
# rt-app my_config.json
|
||||||
|
|
||||||
The parameters that can be specified with the second method are a superset
|
The parameters that can be specified with the second method are a superset
|
||||||
of the command line options. Please refer to rt-app documentation for more
|
of the command line options. Please refer to rt-app documentation for more
|
||||||
details (<rt-app-sources>/doc/*.json).
|
details (`<rt-app-sources>/doc/*.json`).
|
||||||
|
|
||||||
The second testing application is a modification of schedtool, called
|
The second testing application is a modification of schedtool, called
|
||||||
schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a
|
schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a
|
||||||
certain pid/application. schedtool-dl is available at:
|
certain pid/application. schedtool-dl is available at:
|
||||||
https://github.com/scheduler-tools/schedtool-dl.git.
|
https://github.com/scheduler-tools/schedtool-dl.git.
|
||||||
|
|
||||||
The usage is straightforward:
|
The usage is straightforward::
|
||||||
|
|
||||||
# schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app
|
# schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app
|
||||||
|
|
||||||
With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation
|
With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation
|
||||||
of 10ms every 100ms (note that parameters are expressed in microseconds).
|
of 10ms every 100ms (note that parameters are expressed in microseconds).
|
||||||
You can also use schedtool to create a reservation for an already running
|
You can also use schedtool to create a reservation for an already running
|
||||||
application, given that you know its pid:
|
application, given that you know its pid::
|
||||||
|
|
||||||
# schedtool -E -t 10000000:100000000 my_app_pid
|
# schedtool -E -t 10000000:100000000 my_app_pid
|
||||||
|
|
||||||
@ -750,7 +767,7 @@ Appendix B. Minimal main()
|
|||||||
|
|
||||||
We provide in what follows a simple (ugly) self-contained code snippet
|
We provide in what follows a simple (ugly) self-contained code snippet
|
||||||
showing how SCHED_DEADLINE reservations can be created by a real-time
|
showing how SCHED_DEADLINE reservations can be created by a real-time
|
||||||
application developer.
|
application developer::
|
||||||
|
|
||||||
#define _GNU_SOURCE
|
#define _GNU_SOURCE
|
||||||
#include <unistd.h>
|
#include <unistd.h>
|
@ -1,9 +1,10 @@
|
|||||||
=============
|
=============
|
||||||
CFS Scheduler
|
CFS Scheduler
|
||||||
=============
|
=============
|
||||||
|
|
||||||
|
|
||||||
1. OVERVIEW
|
1. OVERVIEW
|
||||||
|
============
|
||||||
|
|
||||||
CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
|
CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
|
||||||
scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
|
scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
|
||||||
@ -27,6 +28,7 @@ is its actual runtime normalized to the total number of running tasks.
|
|||||||
|
|
||||||
|
|
||||||
2. FEW IMPLEMENTATION DETAILS
|
2. FEW IMPLEMENTATION DETAILS
|
||||||
|
==============================
|
||||||
|
|
||||||
In CFS the virtual runtime is expressed and tracked via the per-task
|
In CFS the virtual runtime is expressed and tracked via the per-task
|
||||||
p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
|
p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
|
||||||
@ -49,6 +51,7 @@ algorithm variants to recognize sleepers.
|
|||||||
|
|
||||||
|
|
||||||
3. THE RBTREE
|
3. THE RBTREE
|
||||||
|
==============
|
||||||
|
|
||||||
CFS's design is quite radical: it does not use the old data structures for the
|
CFS's design is quite radical: it does not use the old data structures for the
|
||||||
runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
|
runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
|
||||||
@ -84,6 +87,7 @@ picked and the current task is preempted.
|
|||||||
|
|
||||||
|
|
||||||
4. SOME FEATURES OF CFS
|
4. SOME FEATURES OF CFS
|
||||||
|
========================
|
||||||
|
|
||||||
CFS uses nanosecond granularity accounting and does not rely on any jiffies or
|
CFS uses nanosecond granularity accounting and does not rely on any jiffies or
|
||||||
other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
|
other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
|
||||||
@ -113,6 +117,7 @@ result.
|
|||||||
|
|
||||||
|
|
||||||
5. Scheduling policies
|
5. Scheduling policies
|
||||||
|
======================
|
||||||
|
|
||||||
CFS implements three scheduling policies:
|
CFS implements three scheduling policies:
|
||||||
|
|
||||||
@ -137,6 +142,7 @@ SCHED_IDLE.
|
|||||||
|
|
||||||
|
|
||||||
6. SCHEDULING CLASSES
|
6. SCHEDULING CLASSES
|
||||||
|
======================
|
||||||
|
|
||||||
The new CFS scheduler has been designed in such a way to introduce "Scheduling
|
The new CFS scheduler has been designed in such a way to introduce "Scheduling
|
||||||
Classes," an extensible hierarchy of scheduler modules. These modules
|
Classes," an extensible hierarchy of scheduler modules. These modules
|
||||||
@ -197,6 +203,7 @@ This is the (partial) list of the hooks:
|
|||||||
|
|
||||||
|
|
||||||
7. GROUP SCHEDULER EXTENSIONS TO CFS
|
7. GROUP SCHEDULER EXTENSIONS TO CFS
|
||||||
|
=====================================
|
||||||
|
|
||||||
Normally, the scheduler operates on individual tasks and strives to provide
|
Normally, the scheduler operates on individual tasks and strives to provide
|
||||||
fair CPU time to each task. Sometimes, it may be desirable to group tasks and
|
fair CPU time to each task. Sometimes, it may be desirable to group tasks and
|
||||||
@ -219,7 +226,7 @@ SCHED_BATCH) tasks.
|
|||||||
|
|
||||||
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
|
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
|
||||||
group created using the pseudo filesystem. See example steps below to create
|
group created using the pseudo filesystem. See example steps below to create
|
||||||
task groups and modify their CPU share using the "cgroups" pseudo filesystem.
|
task groups and modify their CPU share using the "cgroups" pseudo filesystem::
|
||||||
|
|
||||||
# mount -t tmpfs cgroup_root /sys/fs/cgroup
|
# mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||||
# mkdir /sys/fs/cgroup/cpu
|
# mkdir /sys/fs/cgroup/cpu
|
@ -1,3 +1,7 @@
|
|||||||
|
=================
|
||||||
|
Scheduler Domains
|
||||||
|
=================
|
||||||
|
|
||||||
Each CPU has a "base" scheduling domain (struct sched_domain). The domain
|
Each CPU has a "base" scheduling domain (struct sched_domain). The domain
|
||||||
hierarchy is built from these base domains via the ->parent pointer. ->parent
|
hierarchy is built from these base domains via the ->parent pointer. ->parent
|
||||||
MUST be NULL terminated, and domain structures should be per-CPU as they are
|
MUST be NULL terminated, and domain structures should be per-CPU as they are
|
||||||
@ -46,7 +50,9 @@ CPU's runqueue and the newly found busiest one and starts moving tasks from it
|
|||||||
to our runqueue. The exact number of tasks amounts to an imbalance previously
|
to our runqueue. The exact number of tasks amounts to an imbalance previously
|
||||||
computed while iterating over this sched domain's groups.
|
computed while iterating over this sched domain's groups.
|
||||||
|
|
||||||
*** Implementing sched domains ***
|
Implementing sched domains
|
||||||
|
==========================
|
||||||
|
|
||||||
The "base" domain will "span" the first level of the hierarchy. In the case
|
The "base" domain will "span" the first level of the hierarchy. In the case
|
||||||
of SMT, you'll span all siblings of the physical CPU, with each group being
|
of SMT, you'll span all siblings of the physical CPU, with each group being
|
||||||
a single virtual CPU.
|
a single virtual CPU.
|
@ -1,6 +1,6 @@
|
|||||||
=======================
|
=======================
|
||||||
Energy Aware Scheduling
|
Energy Aware Scheduling
|
||||||
=======================
|
=======================
|
||||||
|
|
||||||
1. Introduction
|
1. Introduction
|
||||||
---------------
|
---------------
|
||||||
@ -12,7 +12,7 @@ with a minimal impact on throughput. This document aims at providing an
|
|||||||
introduction on how EAS works, what are the main design decisions behind it, and
|
introduction on how EAS works, what are the main design decisions behind it, and
|
||||||
details what is needed to get it to run.
|
details what is needed to get it to run.
|
||||||
|
|
||||||
Before going any further, please note that at the time of writing:
|
Before going any further, please note that at the time of writing::
|
||||||
|
|
||||||
/!\ EAS does not support platforms with symmetric CPU topologies /!\
|
/!\ EAS does not support platforms with symmetric CPU topologies /!\
|
||||||
|
|
||||||
@ -33,13 +33,13 @@ To make it clear from the start:
|
|||||||
- power = energy/time = [joule/second] = [watt]
|
- power = energy/time = [joule/second] = [watt]
|
||||||
|
|
||||||
The goal of EAS is to minimize energy, while still getting the job done. That
|
The goal of EAS is to minimize energy, while still getting the job done. That
|
||||||
is, we want to maximize:
|
is, we want to maximize::
|
||||||
|
|
||||||
performance [inst/s]
|
performance [inst/s]
|
||||||
--------------------
|
--------------------
|
||||||
power [W]
|
power [W]
|
||||||
|
|
||||||
which is equivalent to minimizing:
|
which is equivalent to minimizing::
|
||||||
|
|
||||||
energy [J]
|
energy [J]
|
||||||
-----------
|
-----------
|
||||||
@ -97,7 +97,7 @@ domains can contain duplicate elements.
|
|||||||
|
|
||||||
Example 1.
|
Example 1.
|
||||||
Let us consider a platform with 12 CPUs, split in 3 performance domains
|
Let us consider a platform with 12 CPUs, split in 3 performance domains
|
||||||
(pd0, pd4 and pd8), organized as follows:
|
(pd0, pd4 and pd8), organized as follows::
|
||||||
|
|
||||||
CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
|
CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
|
||||||
PDs: |--pd0--|--pd4--|---pd8---|
|
PDs: |--pd0--|--pd4--|---pd8---|
|
||||||
@ -108,6 +108,7 @@ Example 1.
|
|||||||
containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
|
containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
|
||||||
above figure. Since pd4 intersects with both rd1 and rd2, it will be
|
above figure. Since pd4 intersects with both rd1 and rd2, it will be
|
||||||
present in the linked list '->pd' attached to each of them:
|
present in the linked list '->pd' attached to each of them:
|
||||||
|
|
||||||
* rd1->pd: pd0 -> pd4
|
* rd1->pd: pd0 -> pd4
|
||||||
* rd2->pd: pd4 -> pd8
|
* rd2->pd: pd4 -> pd8
|
||||||
|
|
||||||
@ -159,7 +160,7 @@ Example 2.
|
|||||||
Each performance domain has three Operating Performance Points (OPPs).
|
Each performance domain has three Operating Performance Points (OPPs).
|
||||||
The CPU capacity and power cost associated with each OPP is listed in
|
The CPU capacity and power cost associated with each OPP is listed in
|
||||||
the Energy Model table. The util_avg of P is shown on the figures
|
the Energy Model table. The util_avg of P is shown on the figures
|
||||||
below as 'PP'.
|
below as 'PP'::
|
||||||
|
|
||||||
CPU util.
|
CPU util.
|
||||||
1024 - - - - - - - Energy Model
|
1024 - - - - - - - Energy Model
|
||||||
@ -188,8 +189,7 @@ Example 2.
|
|||||||
(which is coherent with the behaviour of the schedutil CPUFreq
|
(which is coherent with the behaviour of the schedutil CPUFreq
|
||||||
governor, see Section 6. for more details on this topic).
|
governor, see Section 6. for more details on this topic).
|
||||||
|
|
||||||
Case 1. P is migrated to CPU1
|
**Case 1. P is migrated to CPU1**::
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
1024 - - - - - - -
|
1024 - - - - - - -
|
||||||
|
|
||||||
@ -207,8 +207,7 @@ Example 2.
|
|||||||
CPU0 CPU1 CPU2 CPU3
|
CPU0 CPU1 CPU2 CPU3
|
||||||
|
|
||||||
|
|
||||||
Case 2. P is migrated to CPU3
|
**Case 2. P is migrated to CPU3**::
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
1024 - - - - - - -
|
1024 - - - - - - -
|
||||||
|
|
||||||
@ -226,8 +225,7 @@ Example 2.
|
|||||||
CPU0 CPU1 CPU2 CPU3
|
CPU0 CPU1 CPU2 CPU3
|
||||||
|
|
||||||
|
|
||||||
Case 3. P stays on prev_cpu / CPU 0
|
**Case 3. P stays on prev_cpu / CPU 0**::
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
1024 - - - - - - -
|
1024 - - - - - - -
|
||||||
|
|
||||||
@ -324,7 +322,9 @@ hardware properties and on other features of the kernel being enabled. This
|
|||||||
section lists these dependencies and provides hints as to how they can be met.
|
section lists these dependencies and provides hints as to how they can be met.
|
||||||
|
|
||||||
|
|
||||||
6.1 - Asymmetric CPU topology
|
6.1 - Asymmetric CPU topology
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
|
||||||
As mentioned in the introduction, EAS is only supported on platforms with
|
As mentioned in the introduction, EAS is only supported on platforms with
|
||||||
asymmetric CPU topologies for now. This requirement is checked at run-time by
|
asymmetric CPU topologies for now. This requirement is checked at run-time by
|
||||||
@ -347,7 +347,8 @@ significant savings on SMP platforms have been observed yet. This restriction
|
|||||||
could be amended in the future if proven otherwise.
|
could be amended in the future if proven otherwise.
|
||||||
|
|
||||||
|
|
||||||
6.2 - Energy Model presence
|
6.2 - Energy Model presence
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
EAS uses the EM of a platform to estimate the impact of scheduling decisions on
|
EAS uses the EM of a platform to estimate the impact of scheduling decisions on
|
||||||
energy. So, your platform must provide power cost tables to the EM framework in
|
energy. So, your platform must provide power cost tables to the EM framework in
|
||||||
@ -358,7 +359,8 @@ Please also note that the scheduling domains need to be re-built after the
|
|||||||
EM has been registered in order to start EAS.
|
EM has been registered in order to start EAS.
|
||||||
|
|
||||||
|
|
||||||
6.3 - Energy Model complexity
|
6.3 - Energy Model complexity
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
The task wake-up path is very latency-sensitive. When the EM of a platform is
|
The task wake-up path is very latency-sensitive. When the EM of a platform is
|
||||||
too complex (too many CPUs, too many performance domains, too many performance
|
too complex (too many CPUs, too many performance domains, too many performance
|
||||||
@ -388,7 +390,8 @@ two possible options:
|
|||||||
hence enabling it to cope with larger EMs in reasonable time.
|
hence enabling it to cope with larger EMs in reasonable time.
|
||||||
|
|
||||||
|
|
||||||
6.4 - Schedutil governor
|
6.4 - Schedutil governor
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
EAS tries to predict at which OPP will the CPUs be running in the close future
|
EAS tries to predict at which OPP will the CPUs be running in the close future
|
||||||
in order to estimate their energy consumption. To do so, it is assumed that OPPs
|
in order to estimate their energy consumption. To do so, it is assumed that OPPs
|
||||||
@ -405,7 +408,8 @@ frequency requests and energy predictions.
|
|||||||
Using EAS with any other governor than schedutil is not supported.
|
Using EAS with any other governor than schedutil is not supported.
|
||||||
|
|
||||||
|
|
||||||
6.5 Scale-invariant utilization signals
|
6.5 Scale-invariant utilization signals
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
In order to make accurate prediction across CPUs and for all performance
|
In order to make accurate prediction across CPUs and for all performance
|
||||||
states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
|
states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
|
||||||
@ -416,7 +420,8 @@ Using EAS on a platform that doesn't implement these two callbacks is not
|
|||||||
supported.
|
supported.
|
||||||
|
|
||||||
|
|
||||||
6.6 Multithreading (SMT)
|
6.6 Multithreading (SMT)
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
EAS in its current form is SMT unaware and is not able to leverage
|
EAS in its current form is SMT unaware and is not able to leverage
|
||||||
multithreaded hardware to save energy. EAS considers threads as independent
|
multithreaded hardware to save energy. EAS considers threads as independent
|
@ -1,3 +1,7 @@
|
|||||||
|
=====================
|
||||||
|
Scheduler Nice Design
|
||||||
|
=====================
|
||||||
|
|
||||||
This document explains the thinking about the revamped and streamlined
|
This document explains the thinking about the revamped and streamlined
|
||||||
nice-levels implementation in the new Linux scheduler.
|
nice-levels implementation in the new Linux scheduler.
|
||||||
|
|
||||||
@ -14,7 +18,7 @@ much stronger than they were before in 2.4 (and people were happy about
|
|||||||
that change), and we also intentionally calibrated the linear timeslice
|
that change), and we also intentionally calibrated the linear timeslice
|
||||||
rule so that nice +19 level would be _exactly_ 1 jiffy. To better
|
rule so that nice +19 level would be _exactly_ 1 jiffy. To better
|
||||||
understand it, the timeslice graph went like this (cheesy ASCII art
|
understand it, the timeslice graph went like this (cheesy ASCII art
|
||||||
alert!):
|
alert!)::
|
||||||
|
|
||||||
|
|
||||||
A
|
A
|
@ -1,18 +1,18 @@
|
|||||||
Real-Time group scheduling
|
==========================
|
||||||
--------------------------
|
Real-Time group scheduling
|
||||||
|
==========================
|
||||||
|
|
||||||
CONTENTS
|
.. CONTENTS
|
||||||
========
|
|
||||||
|
|
||||||
0. WARNING
|
0. WARNING
|
||||||
1. Overview
|
1. Overview
|
||||||
1.1 The problem
|
1.1 The problem
|
||||||
1.2 The solution
|
1.2 The solution
|
||||||
2. The interface
|
2. The interface
|
||||||
2.1 System-wide settings
|
2.1 System-wide settings
|
||||||
2.2 Default behaviour
|
2.2 Default behaviour
|
||||||
2.3 Basis for grouping tasks
|
2.3 Basis for grouping tasks
|
||||||
3. Future plans
|
3. Future plans
|
||||||
|
|
||||||
|
|
||||||
0. WARNING
|
0. WARNING
|
||||||
@ -159,9 +159,11 @@ Consider two sibling groups A and B; both have 50% bandwidth, but A's
|
|||||||
period is twice the length of B's.
|
period is twice the length of B's.
|
||||||
|
|
||||||
* group A: period=100000us, runtime=50000us
|
* group A: period=100000us, runtime=50000us
|
||||||
|
|
||||||
- this runs for 0.05s once every 0.1s
|
- this runs for 0.05s once every 0.1s
|
||||||
|
|
||||||
* group B: period= 50000us, runtime=25000us
|
* group B: period= 50000us, runtime=25000us
|
||||||
|
|
||||||
- this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
|
- this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
|
||||||
|
|
||||||
This means that currently a while (1) loop in A will run for the full period of
|
This means that currently a while (1) loop in A will run for the full period of
|
@ -1,3 +1,7 @@
|
|||||||
|
====================
|
||||||
|
Scheduler Statistics
|
||||||
|
====================
|
||||||
|
|
||||||
Version 15 of schedstats dropped counters for some sched_yield:
|
Version 15 of schedstats dropped counters for some sched_yield:
|
||||||
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
|
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
|
||||||
identical to version 14.
|
identical to version 14.
|
||||||
@ -35,19 +39,23 @@ CPU statistics
|
|||||||
cpu<N> 1 2 3 4 5 6 7 8 9
|
cpu<N> 1 2 3 4 5 6 7 8 9
|
||||||
|
|
||||||
First field is a sched_yield() statistic:
|
First field is a sched_yield() statistic:
|
||||||
|
|
||||||
1) # of times sched_yield() was called
|
1) # of times sched_yield() was called
|
||||||
|
|
||||||
Next three are schedule() statistics:
|
Next three are schedule() statistics:
|
||||||
|
|
||||||
2) This field is a legacy array expiration count field used in the O(1)
|
2) This field is a legacy array expiration count field used in the O(1)
|
||||||
scheduler. We kept it for ABI compatibility, but it is always set to zero.
|
scheduler. We kept it for ABI compatibility, but it is always set to zero.
|
||||||
3) # of times schedule() was called
|
3) # of times schedule() was called
|
||||||
4) # of times schedule() left the processor idle
|
4) # of times schedule() left the processor idle
|
||||||
|
|
||||||
Next two are try_to_wake_up() statistics:
|
Next two are try_to_wake_up() statistics:
|
||||||
|
|
||||||
5) # of times try_to_wake_up() was called
|
5) # of times try_to_wake_up() was called
|
||||||
6) # of times try_to_wake_up() was called to wake up the local cpu
|
6) # of times try_to_wake_up() was called to wake up the local cpu
|
||||||
|
|
||||||
Next three are statistics describing scheduling latency:
|
Next three are statistics describing scheduling latency:
|
||||||
|
|
||||||
7) sum of all time spent running by tasks on this processor (in jiffies)
|
7) sum of all time spent running by tasks on this processor (in jiffies)
|
||||||
8) sum of all time spent waiting to run by tasks on this processor (in
|
8) sum of all time spent waiting to run by tasks on this processor (in
|
||||||
jiffies)
|
jiffies)
|
||||||
@ -83,7 +91,6 @@ of idleness (idle, busy, and newly idle):
|
|||||||
not find a busier queue while the cpu was idle
|
not find a busier queue while the cpu was idle
|
||||||
8) # of times in this domain a busier queue was found while the
|
8) # of times in this domain a busier queue was found while the
|
||||||
cpu was idle but no busier group was found
|
cpu was idle but no busier group was found
|
||||||
|
|
||||||
9) # of times in this domain load_balance() was called when the
|
9) # of times in this domain load_balance() was called when the
|
||||||
cpu was busy
|
cpu was busy
|
||||||
10) # of times in this domain load_balance() checked but found the
|
10) # of times in this domain load_balance() checked but found the
|
||||||
@ -117,21 +124,25 @@ of idleness (idle, busy, and newly idle):
|
|||||||
was just becoming idle but no busier group was found
|
was just becoming idle but no busier group was found
|
||||||
|
|
||||||
Next three are active_load_balance() statistics:
|
Next three are active_load_balance() statistics:
|
||||||
|
|
||||||
25) # of times active_load_balance() was called
|
25) # of times active_load_balance() was called
|
||||||
26) # of times active_load_balance() tried to move a task and failed
|
26) # of times active_load_balance() tried to move a task and failed
|
||||||
27) # of times active_load_balance() successfully moved a task
|
27) # of times active_load_balance() successfully moved a task
|
||||||
|
|
||||||
Next three are sched_balance_exec() statistics:
|
Next three are sched_balance_exec() statistics:
|
||||||
|
|
||||||
28) sbe_cnt is not used
|
28) sbe_cnt is not used
|
||||||
29) sbe_balanced is not used
|
29) sbe_balanced is not used
|
||||||
30) sbe_pushed is not used
|
30) sbe_pushed is not used
|
||||||
|
|
||||||
Next three are sched_balance_fork() statistics:
|
Next three are sched_balance_fork() statistics:
|
||||||
|
|
||||||
31) sbf_cnt is not used
|
31) sbf_cnt is not used
|
||||||
32) sbf_balanced is not used
|
32) sbf_balanced is not used
|
||||||
33) sbf_pushed is not used
|
33) sbf_pushed is not used
|
||||||
|
|
||||||
Next three are try_to_wake_up() statistics:
|
Next three are try_to_wake_up() statistics:
|
||||||
|
|
||||||
34) # of times in this domain try_to_wake_up() awoke a task that
|
34) # of times in this domain try_to_wake_up() awoke a task that
|
||||||
last ran on a different cpu in this domain
|
last ran on a different cpu in this domain
|
||||||
35) # of times in this domain try_to_wake_up() moved a task to the
|
35) # of times in this domain try_to_wake_up() moved a task to the
|
||||||
@ -139,10 +150,11 @@ of idleness (idle, busy, and newly idle):
|
|||||||
36) # of times in this domain try_to_wake_up() started passive balancing
|
36) # of times in this domain try_to_wake_up() started passive balancing
|
||||||
|
|
||||||
/proc/<pid>/schedstat
|
/proc/<pid>/schedstat
|
||||||
----------------
|
---------------------
|
||||||
schedstats also adds a new /proc/<pid>/schedstat file to include some of
|
schedstats also adds a new /proc/<pid>/schedstat file to include some of
|
||||||
the same information on a per-process level. There are three fields in
|
the same information on a per-process level. There are three fields in
|
||||||
this file correlating for that process to:
|
this file correlating for that process to:
|
||||||
|
|
||||||
1) time spent on the cpu
|
1) time spent on the cpu
|
||||||
2) time spent waiting on a runqueue
|
2) time spent waiting on a runqueue
|
||||||
3) # of timeslices run on this cpu
|
3) # of timeslices run on this cpu
|
||||||
@ -151,4 +163,5 @@ A program could be easily written to make use of these extra fields to
|
|||||||
report on how well a particular process or set of processes is faring
|
report on how well a particular process or set of processes is faring
|
||||||
under the scheduler's policies. A simple version of such a program is
|
under the scheduler's policies. A simple version of such a program is
|
||||||
available at
|
available at
|
||||||
|
|
||||||
http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c
|
http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c
|
5
Documentation/scheduler/text_files.rst
Normal file
5
Documentation/scheduler/text_files.rst
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
Scheduler pelt c program
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
.. literalinclude:: sched-pelt.c
|
||||||
|
:language: c
|
@ -99,7 +99,7 @@ Local allocation will tend to keep subsequent access to the allocated memory
|
|||||||
as long as the task on whose behalf the kernel allocated some memory does not
|
as long as the task on whose behalf the kernel allocated some memory does not
|
||||||
later migrate away from that memory. The Linux scheduler is aware of the
|
later migrate away from that memory. The Linux scheduler is aware of the
|
||||||
NUMA topology of the platform--embodied in the "scheduling domains" data
|
NUMA topology of the platform--embodied in the "scheduling domains" data
|
||||||
structures [see Documentation/scheduler/sched-domains.txt]--and the scheduler
|
structures [see Documentation/scheduler/sched-domains.rst]--and the scheduler
|
||||||
attempts to minimize task migration to distant scheduling domains. However,
|
attempts to minimize task migration to distant scheduling domains. However,
|
||||||
the scheduler does not take a task's NUMA footprint into account directly.
|
the scheduler does not take a task's NUMA footprint into account directly.
|
||||||
Thus, under sufficient imbalance, tasks can migrate between nodes, remote
|
Thus, under sufficient imbalance, tasks can migrate between nodes, remote
|
||||||
|
@ -734,7 +734,7 @@ menuconfig CGROUPS
|
|||||||
use with process control subsystems such as Cpusets, CFS, memory
|
use with process control subsystems such as Cpusets, CFS, memory
|
||||||
controls or device isolation.
|
controls or device isolation.
|
||||||
See
|
See
|
||||||
- Documentation/scheduler/sched-design-CFS.txt (CFS)
|
- Documentation/scheduler/sched-design-CFS.rst (CFS)
|
||||||
- Documentation/cgroup-v1/ (features for grouping, isolation
|
- Documentation/cgroup-v1/ (features for grouping, isolation
|
||||||
and resource control)
|
and resource control)
|
||||||
|
|
||||||
@ -835,7 +835,7 @@ config CFS_BANDWIDTH
|
|||||||
tasks running within the fair group scheduler. Groups with no limit
|
tasks running within the fair group scheduler. Groups with no limit
|
||||||
set are considered to be unconstrained and will run with no
|
set are considered to be unconstrained and will run with no
|
||||||
restriction.
|
restriction.
|
||||||
See Documentation/scheduler/sched-bwc.txt for more information.
|
See Documentation/scheduler/sched-bwc.rst for more information.
|
||||||
|
|
||||||
config RT_GROUP_SCHED
|
config RT_GROUP_SCHED
|
||||||
bool "Group scheduling for SCHED_RR/FIFO"
|
bool "Group scheduling for SCHED_RR/FIFO"
|
||||||
@ -846,7 +846,7 @@ config RT_GROUP_SCHED
|
|||||||
to task groups. If enabled, it will also make it impossible to
|
to task groups. If enabled, it will also make it impossible to
|
||||||
schedule realtime tasks for non-root users until you allocate
|
schedule realtime tasks for non-root users until you allocate
|
||||||
realtime bandwidth for them.
|
realtime bandwidth for them.
|
||||||
See Documentation/scheduler/sched-rt-group.txt for more information.
|
See Documentation/scheduler/sched-rt-group.rst for more information.
|
||||||
|
|
||||||
endif #CGROUP_SCHED
|
endif #CGROUP_SCHED
|
||||||
|
|
||||||
|
@ -726,7 +726,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
|
|||||||
* refill the runtime and set the deadline a period in the future,
|
* refill the runtime and set the deadline a period in the future,
|
||||||
* because keeping the current (absolute) deadline of the task would
|
* because keeping the current (absolute) deadline of the task would
|
||||||
* result in breaking guarantees promised to other tasks (refer to
|
* result in breaking guarantees promised to other tasks (refer to
|
||||||
* Documentation/scheduler/sched-deadline.txt for more information).
|
* Documentation/scheduler/sched-deadline.rst for more information).
|
||||||
*
|
*
|
||||||
* This function returns true if:
|
* This function returns true if:
|
||||||
*
|
*
|
||||||
|
Loading…
Reference in New Issue
Block a user