linux/tools/sched_ext/scx_simple.bpf.c

157 lines
4.5 KiB
C
Raw Normal View History

sched_ext: Add scx_simple and scx_example_qmap example schedulers Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 20:09:17 +00:00
/* SPDX-License-Identifier: GPL-2.0 */
/*
* A simple scheduler.
*
sched_ext: Add vtime-ordered priority queue to dispatch_q's Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18 20:09:21 +00:00
* By default, it operates as a simple global weighted vtime scheduler and can
* be switched to FIFO scheduling. It also demonstrates the following niceties.
sched_ext: Add scx_simple and scx_example_qmap example schedulers Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 20:09:17 +00:00
*
* - Statistics tracking how many tasks are queued to local and global dsq's.
* - Termination notification for userspace.
*
sched_ext: Add vtime-ordered priority queue to dispatch_q's Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18 20:09:21 +00:00
* While very simple, this scheduler should work reasonably well on CPUs with a
* uniform L3 cache topology. While preemption is not implemented, the fact that
* the scheduling queue is shared across all CPUs means that whatever is at the
* front of the queue is likely to be executed fairly quickly given enough
* number of CPUs. The FIFO scheduling mode may be beneficial to some workloads
* but comes with the usual problems with FIFO scheduling where saturating
* threads can easily drown out interactive ones.
*
sched_ext: Add scx_simple and scx_example_qmap example schedulers Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 20:09:17 +00:00
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <tj@kernel.org>
* Copyright (c) 2022 David Vernet <dvernet@meta.com>
*/
#include <scx/common.bpf.h>
char _license[] SEC("license") = "GPL";
sched_ext: Add vtime-ordered priority queue to dispatch_q's Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18 20:09:21 +00:00
const volatile bool fifo_sched;
static u64 vtime_now;
sched_ext: Add scx_simple and scx_example_qmap example schedulers Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 20:09:17 +00:00
UEI_DEFINE(uei);
sched_ext: Add vtime-ordered priority queue to dispatch_q's Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18 20:09:21 +00:00
/*
* Built-in DSQs such as SCX_DSQ_GLOBAL cannot be used as priority queues
* (meaning, cannot be dispatched to with scx_bpf_dispatch_vtime()). We
* therefore create a separate DSQ with ID 0 that we dispatch to and consume
* from. If scx_simple only supported global FIFO scheduling, then we could
* just use SCX_DSQ_GLOBAL.
*/
#define SHARED_DSQ 0
sched_ext: Add scx_simple and scx_example_qmap example schedulers Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 20:09:17 +00:00
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u64));
__uint(max_entries, 2); /* [local, global] */
} stats SEC(".maps");
static void stat_inc(u32 idx)
{
u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
if (cnt_p)
(*cnt_p)++;
}
sched_ext: Add vtime-ordered priority queue to dispatch_q's Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18 20:09:21 +00:00
static inline bool vtime_before(u64 a, u64 b)
{
return (s64)(a - b) < 0;
}
sched_ext: Add scx_simple and scx_example_qmap example schedulers Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 20:09:17 +00:00
s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
{
bool is_idle = false;
s32 cpu;
cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
if (is_idle) {
stat_inc(0); /* count local queueing */
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
}
return cpu;
}
void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
{
stat_inc(1); /* count global queueing */
sched_ext: Add vtime-ordered priority queue to dispatch_q's Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18 20:09:21 +00:00
if (fifo_sched) {
scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
} else {
u64 vtime = p->scx.dsq_vtime;
/*
* Limit the amount of budget that an idling task can accumulate
* to one slice.
*/
if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
vtime = vtime_now - SCX_SLICE_DFL;
scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
enq_flags);
}
}
void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev)
{
scx_bpf_consume(SHARED_DSQ);
}
void BPF_STRUCT_OPS(simple_running, struct task_struct *p)
{
if (fifo_sched)
return;
/*
* Global vtime always progresses forward as tasks start executing. The
* test and update can be performed concurrently from multiple CPUs and
* thus racy. Any error should be contained and temporary. Let's just
* live with it.
*/
if (vtime_before(vtime_now, p->scx.dsq_vtime))
vtime_now = p->scx.dsq_vtime;
}
void BPF_STRUCT_OPS(simple_stopping, struct task_struct *p, bool runnable)
{
if (fifo_sched)
return;
/*
* Scale the execution time by the inverse of the weight and charge.
*
* Note that the default yield implementation yields by setting
* @p->scx.slice to zero and the following would treat the yielding task
* as if it has consumed all its slice. If this penalizes yielding tasks
* too much, determine the execution time by taking explicit timestamps
* instead of depending on @p->scx.slice.
*/
p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight;
}
void BPF_STRUCT_OPS(simple_enable, struct task_struct *p)
{
p->scx.dsq_vtime = vtime_now;
}
s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
{
return scx_bpf_create_dsq(SHARED_DSQ, -1);
sched_ext: Add scx_simple and scx_example_qmap example schedulers Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 20:09:17 +00:00
}
void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
{
UEI_RECORD(uei, ei);
}
SCX_OPS_DEFINE(simple_ops,
.select_cpu = (void *)simple_select_cpu,
.enqueue = (void *)simple_enqueue,
sched_ext: Add vtime-ordered priority queue to dispatch_q's Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>
2024-06-18 20:09:21 +00:00
.dispatch = (void *)simple_dispatch,
.running = (void *)simple_running,
.stopping = (void *)simple_stopping,
.enable = (void *)simple_enable,
.init = (void *)simple_init,
sched_ext: Add scx_simple and scx_example_qmap example schedulers Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-06-18 20:09:17 +00:00
.exit = (void *)simple_exit,
.name = "simple");