mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-01-07 13:53:24 +00:00
431844b65f
As discussed during the distro-centric session within the sched_ext Microconference at LPC 2024, introduce a sequence counter that is incremented every time a BPF scheduler is loaded. This feature can help distributions in diagnosing potential performance regressions by identifying systems where users are running (or have ran) custom BPF schedulers. Example: arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 0 arighi@virtme-ng~> sudo scx_simple local=1 global=0 ^CEXIT: unregistered from user space arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 1 In this way user-space tools (such as Ubuntu's apport and similar) are able to gather and include this information in bug reports. Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com> Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Phil Auld <pauld@redhat.com> Signed-off-by: Andrea Righi <andrea.righi@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
327 lines
12 KiB
ReStructuredText
327 lines
12 KiB
ReStructuredText
==========================
|
|
Extensible Scheduler Class
|
|
==========================
|
|
|
|
sched_ext is a scheduler class whose behavior can be defined by a set of BPF
|
|
programs - the BPF scheduler.
|
|
|
|
* sched_ext exports a full scheduling interface so that any scheduling
|
|
algorithm can be implemented on top.
|
|
|
|
* The BPF scheduler can group CPUs however it sees fit and schedule them
|
|
together, as tasks aren't tied to specific CPUs at the time of wakeup.
|
|
|
|
* The BPF scheduler can be turned on and off dynamically anytime.
|
|
|
|
* The system integrity is maintained no matter what the BPF scheduler does.
|
|
The default scheduling behavior is restored anytime an error is detected,
|
|
a runnable task stalls, or on invoking the SysRq key sequence
|
|
:kbd:`SysRq-S`.
|
|
|
|
* When the BPF scheduler triggers an error, debug information is dumped to
|
|
aid debugging. The debug dump is passed to and printed out by the
|
|
scheduler binary. The debug dump can also be accessed through the
|
|
`sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D`
|
|
triggers a debug dump. This doesn't terminate the BPF scheduler and can
|
|
only be read through the tracepoint.
|
|
|
|
Switching to and from sched_ext
|
|
===============================
|
|
|
|
``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
|
|
``tools/sched_ext`` contains the example schedulers. The following config
|
|
options should be enabled to use sched_ext:
|
|
|
|
.. code-block:: none
|
|
|
|
CONFIG_BPF=y
|
|
CONFIG_SCHED_CLASS_EXT=y
|
|
CONFIG_BPF_SYSCALL=y
|
|
CONFIG_BPF_JIT=y
|
|
CONFIG_DEBUG_INFO_BTF=y
|
|
CONFIG_BPF_JIT_ALWAYS_ON=y
|
|
CONFIG_BPF_JIT_DEFAULT_ON=y
|
|
CONFIG_PAHOLE_HAS_SPLIT_BTF=y
|
|
CONFIG_PAHOLE_HAS_BTF_TAG=y
|
|
|
|
sched_ext is used only when the BPF scheduler is loaded and running.
|
|
|
|
If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
|
|
treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
|
|
loaded.
|
|
|
|
When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set
|
|
in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and
|
|
``SCHED_EXT`` tasks are scheduled by sched_ext.
|
|
|
|
However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is
|
|
set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled
|
|
by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and
|
|
``SCHED_IDLE`` policies are scheduled by CFS.
|
|
|
|
Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
|
|
detection of any internal error including stalled runnable tasks aborts the
|
|
BPF scheduler and reverts all tasks back to CFS.
|
|
|
|
.. code-block:: none
|
|
|
|
# make -j16 -C tools/sched_ext
|
|
# tools/sched_ext/scx_simple
|
|
local=0 global=3
|
|
local=5 global=24
|
|
local=9 global=44
|
|
local=13 global=56
|
|
local=17 global=72
|
|
^CEXIT: BPF scheduler unregistered
|
|
|
|
The current status of the BPF scheduler can be determined as follows:
|
|
|
|
.. code-block:: none
|
|
|
|
# cat /sys/kernel/sched_ext/state
|
|
enabled
|
|
# cat /sys/kernel/sched_ext/root/ops
|
|
simple
|
|
|
|
You can check if any BPF scheduler has ever been loaded since boot by examining
|
|
this monotonically incrementing counter (a value of zero indicates that no BPF
|
|
scheduler has been loaded):
|
|
|
|
.. code-block:: none
|
|
|
|
# cat /sys/kernel/sched_ext/enable_seq
|
|
1
|
|
|
|
``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
|
|
detailed information:
|
|
|
|
.. code-block:: none
|
|
|
|
# tools/sched_ext/scx_show_state.py
|
|
ops : simple
|
|
enabled : 1
|
|
switching_all : 1
|
|
switched_all : 1
|
|
enable_state : enabled (2)
|
|
bypass_depth : 0
|
|
nr_rejected : 0
|
|
enable_seq : 1
|
|
|
|
If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can
|
|
be determined as follows:
|
|
|
|
.. code-block:: none
|
|
|
|
# grep ext /proc/self/sched
|
|
ext.enabled : 1
|
|
|
|
The Basics
|
|
==========
|
|
|
|
Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
|
|
programs that implement ``struct sched_ext_ops``. The only mandatory field
|
|
is ``ops.name`` which must be a valid BPF object name. All operations are
|
|
optional. The following modified excerpt is from
|
|
``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
|
|
|
|
.. code-block:: c
|
|
|
|
/*
|
|
* Decide which CPU a task should be migrated to before being
|
|
* enqueued (either at wakeup, fork time, or exec time). If an
|
|
* idle core is found by the default ops.select_cpu() implementation,
|
|
* then dispatch the task directly to SCX_DSQ_LOCAL and skip the
|
|
* ops.enqueue() callback.
|
|
*
|
|
* Note that this implementation has exactly the same behavior as the
|
|
* default ops.select_cpu implementation. The behavior of the scheduler
|
|
* would be exactly same if the implementation just didn't define the
|
|
* simple_select_cpu() struct_ops prog.
|
|
*/
|
|
s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
|
|
s32 prev_cpu, u64 wake_flags)
|
|
{
|
|
s32 cpu;
|
|
/* Need to initialize or the BPF verifier will reject the program */
|
|
bool direct = false;
|
|
|
|
cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
|
|
|
|
if (direct)
|
|
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
|
|
|
|
return cpu;
|
|
}
|
|
|
|
/*
|
|
* Do a direct dispatch of a task to the global DSQ. This ops.enqueue()
|
|
* callback will only be invoked if we failed to find a core to dispatch
|
|
* to in ops.select_cpu() above.
|
|
*
|
|
* Note that this implementation has exactly the same behavior as the
|
|
* default ops.enqueue implementation, which just dispatches the task
|
|
* to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
|
|
* if the implementation just didn't define the simple_enqueue struct_ops
|
|
* prog.
|
|
*/
|
|
void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
|
|
{
|
|
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
|
|
}
|
|
|
|
s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
|
|
{
|
|
/*
|
|
* By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and
|
|
* SCHED_BATCH tasks should use sched_ext.
|
|
*/
|
|
return 0;
|
|
}
|
|
|
|
void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
|
|
{
|
|
exit_type = ei->type;
|
|
}
|
|
|
|
SEC(".struct_ops")
|
|
struct sched_ext_ops simple_ops = {
|
|
.select_cpu = (void *)simple_select_cpu,
|
|
.enqueue = (void *)simple_enqueue,
|
|
.init = (void *)simple_init,
|
|
.exit = (void *)simple_exit,
|
|
.name = "simple",
|
|
};
|
|
|
|
Dispatch Queues
|
|
---------------
|
|
|
|
To match the impedance between the scheduler core and the BPF scheduler,
|
|
sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
|
|
priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
|
|
and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
|
|
an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
|
|
``scx_bpf_destroy_dsq()``.
|
|
|
|
A CPU always executes a task from its local DSQ. A task is "dispatched" to a
|
|
DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
|
|
local DSQ.
|
|
|
|
When a CPU is looking for the next task to run, if the local DSQ is not
|
|
empty, the first task is picked. Otherwise, the CPU tries to consume the
|
|
global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
|
|
is invoked.
|
|
|
|
Scheduling Cycle
|
|
----------------
|
|
|
|
The following briefly shows how a waking task is scheduled and executed.
|
|
|
|
1. When a task is waking up, ``ops.select_cpu()`` is the first operation
|
|
invoked. This serves two purposes. First, CPU selection optimization
|
|
hint. Second, waking up the selected CPU if idle.
|
|
|
|
The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
|
|
binding. The actual decision is made at the last step of scheduling.
|
|
However, there is a small performance gain if the CPU
|
|
``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
|
|
|
|
A side-effect of selecting a CPU is waking it up from idle. While a BPF
|
|
scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
|
|
using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
|
|
|
|
A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by
|
|
calling ``scx_bpf_dispatch()``. If the task is dispatched to
|
|
``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the
|
|
local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
|
|
Additionally, dispatching directly from ``ops.select_cpu()`` will cause the
|
|
``ops.enqueue()`` callback to be skipped.
|
|
|
|
Note that the scheduler core will ignore an invalid CPU selection, for
|
|
example, if it's outside the allowed cpumask of the task.
|
|
|
|
2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
|
|
task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()``
|
|
can make one of the following decisions:
|
|
|
|
* Immediately dispatch the task to either the global or local DSQ by
|
|
calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
|
|
``SCX_DSQ_LOCAL``, respectively.
|
|
|
|
* Immediately dispatch the task to a custom DSQ by calling
|
|
``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
|
|
|
|
* Queue the task on the BPF side.
|
|
|
|
3. When a CPU is ready to schedule, it first looks at its local DSQ. If
|
|
empty, it then looks at the global DSQ. If there still isn't a task to
|
|
run, ``ops.dispatch()`` is invoked which can use the following two
|
|
functions to populate the local DSQ.
|
|
|
|
* ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
|
|
be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
|
|
``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
|
|
currently can't be called with BPF locks held, this is being worked on
|
|
and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
|
|
rather than performing them immediately. There can be up to
|
|
``ops.dispatch_max_batch`` pending tasks.
|
|
|
|
* ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
|
|
to the dispatching DSQ. This function cannot be called with any BPF
|
|
locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
|
|
before trying to consume the specified DSQ.
|
|
|
|
4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
|
|
the CPU runs the first one. If empty, the following steps are taken:
|
|
|
|
* Try to consume the global DSQ. If successful, run the task.
|
|
|
|
* If ``ops.dispatch()`` has dispatched any tasks, retry #3.
|
|
|
|
* If the previous task is an SCX task and still runnable, keep executing
|
|
it (see ``SCX_OPS_ENQ_LAST``).
|
|
|
|
* Go idle.
|
|
|
|
Note that the BPF scheduler can always choose to dispatch tasks immediately
|
|
in ``ops.enqueue()`` as illustrated in the above simple example. If only the
|
|
built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
|
|
a task is never queued on the BPF scheduler and both the local and global
|
|
DSQs are consumed automatically.
|
|
|
|
``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use
|
|
``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as
|
|
``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
|
|
dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the
|
|
function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for
|
|
more information.
|
|
|
|
Where to Look
|
|
=============
|
|
|
|
* ``include/linux/sched/ext.h`` defines the core data structures, ops table
|
|
and constants.
|
|
|
|
* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
|
|
The functions prefixed with ``scx_bpf_`` can be called from the BPF
|
|
scheduler.
|
|
|
|
* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
|
|
|
|
* ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
|
|
custom DSQ.
|
|
|
|
* ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
|
|
levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
|
|
|
|
ABI Instability
|
|
===============
|
|
|
|
The APIs provided by sched_ext to BPF schedulers programs have no stability
|
|
guarantees. This includes the ops table callbacks and constants defined in
|
|
``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
|
|
``kernel/sched/ext.c``.
|
|
|
|
While we will attempt to provide a relatively stable API surface when
|
|
possible, they are subject to change without warning between kernel
|
|
versions.
|