linux-stable/tools/sched_ext
Tejun Heo 5cbb302880 sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*()
In sched_ext API, a repeatedly reported pain point is the overuse of the
verb "dispatch" and confusion around "consume":

- ops.dispatch()
- scx_bpf_dispatch[_vtime]()
- scx_bpf_consume()
- scx_bpf_dispatch[_vtime]_from_dsq*()

This overloading of the term is historical. Originally, there were only
built-in DSQs and moving a task into a DSQ always dispatched it for
execution. Using the verb "dispatch" for the kfuncs to move tasks into these
DSQs made sense.

Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be
able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was
from a non-local DSQ to a local DSQ and this operation was named "consume".
This was already confusing as a task could be dispatched to a user DSQ from
ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch().
Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse
as "dispatch" in this context meant moving a task to an arbitrary DSQ from a
user DSQ.

Clean up the API with the following renames:

1. scx_bpf_dispatch[_vtime]()		-> scx_bpf_dsq_insert[_vtime]()
2. scx_bpf_consume()			-> scx_bpf_dsq_move_to_local()
3. scx_bpf_dispatch[_vtime]_from_dsq*()	-> scx_bpf_dsq_move[_vtime]*()

This patch performs the third set of renames. Compatibility is maintained
by:

- The previous kfunc names are still provided by the kernel so that old
  binaries can run. Kernel generates a warning when the old names are used.

- compat.bpf.h provides wrappers for the new names which automatically fall
  back to the old names when running on older kernels. They also trigger
  build error if old names are used for new builds.

- scx_bpf_dispatch[_vtime]_from_dsq*() were already wrapped in __COMPAT
  macros as they were introduced during v6.12 cycle. Wrap new API in
  __COMPAT macros too and trigger build errors on both __COMPAT prefixed and
  naked usages of the old names.

The compat features will be dropped after v6.15.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Johannes Bechberger <me@mostlynerdless.de>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11 07:06:16 -10:00
..
include sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() 2024-11-11 07:06:16 -10:00
.gitignore sched_ext: Add scx_simple and scx_example_qmap example schedulers 2024-06-18 10:09:17 -10:00
Makefile sched_ext: Add a cgroup scheduler which uses flattened hierarchy 2024-09-04 10:24:59 -10:00
README.md sched_ext: Add a cgroup scheduler which uses flattened hierarchy 2024-09-04 10:24:59 -10:00
scx_central.bpf.c sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local() 2024-11-11 07:06:16 -10:00
scx_central.c sched_ext: Implement sched_ext_ops.cpu_online/offline() 2024-06-18 10:09:20 -10:00
scx_flatcg.bpf.c sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local() 2024-11-11 07:06:16 -10:00
scx_flatcg.c sched_ext: Add a cgroup scheduler which uses flattened hierarchy 2024-09-04 10:24:59 -10:00
scx_flatcg.h sched_ext: Add a cgroup scheduler which uses flattened hierarchy 2024-09-04 10:24:59 -10:00
scx_qmap.bpf.c sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() 2024-11-11 07:06:16 -10:00
scx_qmap.c scx_qmap: Implement highpri boosting 2024-09-09 13:42:47 -10:00
scx_show_state.py sched_ext: Enable the ops breather and eject BPF scheduler on softlockup 2024-11-08 10:42:22 -10:00
scx_simple.bpf.c sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local() 2024-11-11 07:06:16 -10:00
scx_simple.c sched_ext: Add vtime-ordered priority queue to dispatch_q's 2024-06-18 10:09:21 -10:00

SCHED_EXT EXAMPLE SCHEDULERS

Introduction

This directory contains a number of example sched_ext schedulers. These schedulers are meant to provide examples of different types of schedulers that can be built using sched_ext, and illustrate how various features of sched_ext can be used.

Some of the examples are performant, production-ready schedulers. That is, for the correct workload and with the correct tuning, they may be deployed in a production environment with acceptable or possibly even improved performance. Others are just examples that in practice, would not provide acceptable performance (though they could be improved to get there).

This README will describe these example schedulers, including describing the types of workloads or scenarios they're designed to accommodate, and whether or not they're production ready. For more details on any of these schedulers, please see the header comment in their .bpf.c file.

Compiling the examples

There are a few toolchain dependencies for compiling the example schedulers.

Toolchain dependencies

  1. clang >= 16.0.0

The schedulers are BPF programs, and therefore must be compiled with clang. gcc is actively working on adding a BPF backend compiler as well, but are still missing some features such as BTF type tags which are necessary for using kptrs.

  1. pahole >= 1.25

You may need pahole in order to generate BTF from DWARF.

  1. rust >= 1.70.0

Rust schedulers uses features present in the rust toolchain >= 1.70.0. You should be able to use the stable build from rustup, but if that doesn't work, try using the rustup nightly build.

There are other requirements as well, such as make, but these are the main / non-trivial ones.

Compiling the kernel

In order to run a sched_ext scheduler, you'll have to run a kernel compiled with the patches in this repository, and with a minimum set of necessary Kconfig options:

CONFIG_BPF=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_DEBUG_INFO_BTF=y

It's also recommended that you also include the following Kconfig options:

CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_PAHOLE_HAS_SPLIT_BTF=y
CONFIG_PAHOLE_HAS_BTF_TAG=y

There is a Kconfig file in this directory whose contents you can append to your local .config file, as long as there are no conflicts with any existing options in the file.

Getting a vmlinux.h file

You may notice that most of the example schedulers include a "vmlinux.h" file. This is a large, auto-generated header file that contains all of the types defined in some vmlinux binary that was compiled with BTF (i.e. with the BTF-related Kconfig options specified above).

The header file is created using bpftool, by passing it a vmlinux binary compiled with BTF as follows:

$ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h

bpftool analyzes all of the BTF encodings in the binary, and produces a header file that can be included by BPF programs to access those types. For example, using vmlinux.h allows a scheduler to access fields defined directly in vmlinux as follows:

#include "vmlinux.h"
// vmlinux.h is also implicitly included by scx_common.bpf.h.
#include "scx_common.bpf.h"

/*
 * vmlinux.h provides definitions for struct task_struct and
 * struct scx_enable_args.
 */
void BPF_STRUCT_OPS(example_enable, struct task_struct *p,
		    struct scx_enable_args *args)
{
	bpf_printk("Task %s enabled in example scheduler", p->comm);
}

// vmlinux.h provides the definition for struct sched_ext_ops.
SEC(".struct_ops.link")
struct sched_ext_ops example_ops {
	.enable	= (void *)example_enable,
	.name	= "example",
}

The scheduler build system will generate this vmlinux.h file as part of the scheduler build pipeline. It looks for a vmlinux file in the following dependency order:

  1. If the O= environment variable is defined, at $O/vmlinux
  2. If the KBUILD_OUTPUT= environment variable is defined, at $KBUILD_OUTPUT/vmlinux
  3. At ../../vmlinux (i.e. at the root of the kernel tree where you're compiling the schedulers)
  4. /sys/kernel/btf/vmlinux
  5. /boot/vmlinux-$(uname -r)

In other words, if you have compiled a kernel in your local repo, its vmlinux file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of the kernel you're currently running on. This means that if you're running on a kernel with sched_ext support, you may not need to compile a local kernel at all.

Aside on CO-RE

One of the cooler features of BPF is that it supports CO-RE (Compile Once Run Everywhere). This feature allows you to reference fields inside of structs with types defined internal to the kernel, and not have to recompile if you load the BPF program on a different kernel with the field at a different offset. In our example above, we print out a task name with p->comm. CO-RE would perform relocations for that access when the program is loaded to ensure that it's referencing the correct offset for the currently running kernel.

Compiling the schedulers

Once you have your toolchain setup, and a vmlinux that can be used to generate a full vmlinux.h file, you can compile the schedulers using make:

$ make -j($nproc)

Example schedulers

This directory contains the following example schedulers. These schedulers are for testing and demonstrating different aspects of sched_ext. While some may be useful in limited scenarios, they are not intended to be practical.

For more scheduler implementations, tools and documentation, visit https://github.com/sched-ext/scx.

scx_simple

A simple scheduler that provides an example of a minimal sched_ext scheduler. scx_simple can be run in either global weighted vtime mode, or FIFO mode.

Though very simple, in limited scenarios, this scheduler can perform reasonably well on single-socket systems with a unified L3 cache.

scx_qmap

Another simple, yet slightly more complex scheduler that provides an example of a basic weighted FIFO queuing policy. It also provides examples of some common useful BPF features, such as sleepable per-task storage allocation in the ops.prep_enable() callback, and using the BPF_MAP_TYPE_QUEUE map type to enqueue tasks. It also illustrates how core-sched support could be implemented.

scx_central

A "central" scheduler where scheduling decisions are made from a single CPU. This scheduler illustrates how scheduling decisions can be dispatched from a single CPU, allowing other cores to run with infinite slices, without timer ticks, and without having to incur the overhead of making scheduling decisions.

The approach demonstrated by this scheduler may be useful for any workload that benefits from minimizing scheduling overhead and timer ticks. An example of where this could be particularly useful is running VMs, where running with infinite slices and no timer ticks allows the VM to avoid unnecessary expensive vmexits.

scx_flatcg

A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical weight-based cgroup CPU control by flattening the cgroup hierarchy into a single layer, by compounding the active weight share at each level. The effect of this is a much more performant CPU controller, which does not need to descend down cgroup trees in order to properly compute a cgroup's share.

Similar to scx_simple, in limited scenarios, this scheduler can perform reasonably well on single socket-socket systems with a unified L3 cache and show significantly lowered hierarchical scheduling overhead.

Troubleshooting

There are a number of common issues that you may run into when building the schedulers. We'll go over some of the common ones here.

Build Failures

Old version of clang

error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole
        _Static_assert(SCX_DSQ_FLAG_BUILTIN,
                       ^~~~~~~~~~~~~~~~~~~~
1 error generated.

This means you built the kernel or the schedulers with an older version of clang than what's supported (i.e. older than 16.0.0). To remediate this:

  1. which clang to make sure you're using a sufficiently new version of clang.

  2. make fullclean in the root path of the repository, and rebuild the kernel and schedulers.

  3. Rebuild the kernel, and then your example schedulers.

The schedulers are also cleaned if you invoke make mrproper in the root directory of the tree.

Stale kernel build / incomplete vmlinux.h file

As described above, you'll need a vmlinux.h file that was generated from a vmlinux built with BTF, and with sched_ext support enabled. If you don't, you'll see errors such as the following which indicate that a type being referenced in a scheduler is unknown:

/path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info'

const struct scx_exit_info *ei)

^

In order to resolve this, please follow the steps above in Getting a vmlinux.h file in order to ensure your schedulers are using a vmlinux.h file that includes the requisite types.

Misc

llvm: [OFF]

You may see the following output when building the schedulers:

Auto-detecting system features:
...                         clang-bpf-co-re: [ on  ]
...                                    llvm: [ OFF ]
...                                  libcap: [ on  ]
...                                  libbfd: [ on  ]

Seeing llvm: [ OFF ] here is not an issue. You can safely ignore.