2021-05-11 22:35:16 +02:00
|
|
|
# SPDX-License-Identifier: GPL-2.0-only
|
|
|
|
|
|
|
|
# BPF interpreter that, for example, classic socket filters depend on.
|
|
|
|
config BPF
|
|
|
|
bool
|
|
|
|
|
|
|
|
# Used by archs to tell that they support BPF JIT compiler plus which
|
|
|
|
# flavour. Only one of the two can be selected for a specific arch since
|
|
|
|
# eBPF JIT supersedes the cBPF JIT.
|
|
|
|
|
|
|
|
# Classic BPF JIT (cBPF)
|
|
|
|
config HAVE_CBPF_JIT
|
|
|
|
bool
|
|
|
|
|
|
|
|
# Extended BPF JIT (eBPF)
|
|
|
|
config HAVE_EBPF_JIT
|
|
|
|
bool
|
|
|
|
|
|
|
|
# Used by archs to tell that they want the BPF JIT compiler enabled by
|
|
|
|
# default for kernels that were compiled with BPF JIT support.
|
|
|
|
config ARCH_WANT_DEFAULT_BPF_JIT
|
|
|
|
bool
|
|
|
|
|
|
|
|
menu "BPF subsystem"
|
|
|
|
|
|
|
|
config BPF_SYSCALL
|
|
|
|
bool "Enable bpf() system call"
|
|
|
|
select BPF
|
|
|
|
select IRQ_WORK
|
2022-03-17 11:05:09 -07:00
|
|
|
select TASKS_RCU if PREEMPTION
|
2021-05-11 22:35:16 +02:00
|
|
|
select TASKS_TRACE_RCU
|
|
|
|
select BINARY_PRINTF
|
2021-07-04 12:02:42 -07:00
|
|
|
select NET_SOCK_MSG if NET
|
bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 16:08:52 +02:00
|
|
|
select NET_XGRESS if NET
|
bpf: Add "live packet" mode for XDP in BPF_PROG_RUN
This adds support for running XDP programs through BPF_PROG_RUN in a mode
that enables live packet processing of the resulting frames. Previous uses
of BPF_PROG_RUN for XDP returned the XDP program return code and the
modified packet data to userspace, which is useful for unit testing of XDP
programs.
The existing BPF_PROG_RUN for XDP allows userspace to set the ingress
ifindex and RXQ number as part of the context object being passed to the
kernel. This patch reuses that code, but adds a new mode with different
semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES
flag.
When running BPF_PROG_RUN in this mode, the XDP program return codes will
be honoured: returning XDP_PASS will result in the frame being injected
into the networking stack as if it came from the selected networking
interface, while returning XDP_TX and XDP_REDIRECT will result in the frame
being transmitted out that interface. XDP_TX is translated into an
XDP_REDIRECT operation to the same interface, since the real XDP_TX action
is only possible from within the network drivers themselves, not from the
process context where BPF_PROG_RUN is executed.
Internally, this new mode of operation creates a page pool instance while
setting up the test run, and feeds pages from that into the XDP program.
The setup cost of this is amortised over the number of repetitions
specified by userspace.
To support the performance testing use case, we further optimise the setup
step so that all pages in the pool are pre-initialised with the packet
data, and pre-computed context and xdp_frame objects stored at the start of
each page. This makes it possible to entirely avoid touching the page
content on each XDP program invocation, and enables sending up to 9
Mpps/core on my test box.
Because the data pages are recycled by the page pool, and the test runner
doesn't re-initialise them for each run, subsequent invocations of the XDP
program will see the packet data in the state it was after the last time it
ran on that particular page. This means that an XDP program that modifies
the packet before redirecting it has to be careful about which assumptions
it makes about the packet content, but that is only an issue for the most
naively written programs.
Enabling the new flag is only allowed when not setting ctx_out and data_out
in the test specification, since using it means frames will be redirected
somewhere else, so they can't be returned.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220309105346.100053-2-toke@redhat.com
2022-03-09 11:53:42 +01:00
|
|
|
select PAGE_POOL if NET
|
2021-05-11 22:35:16 +02:00
|
|
|
default n
|
|
|
|
help
|
|
|
|
Enable the bpf() system call that allows to manipulate BPF programs
|
|
|
|
and maps via file descriptors.
|
|
|
|
|
|
|
|
config BPF_JIT
|
|
|
|
bool "Enable BPF Just In Time compiler"
|
2021-05-12 20:57:14 +02:00
|
|
|
depends on BPF
|
2021-05-11 22:35:16 +02:00
|
|
|
depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
|
|
|
|
depends on MODULES
|
|
|
|
help
|
|
|
|
BPF programs are normally handled by a BPF interpreter. This option
|
|
|
|
allows the kernel to generate native code when a program is loaded
|
|
|
|
into the kernel. This will significantly speed-up processing of BPF
|
|
|
|
programs.
|
|
|
|
|
|
|
|
Note, an admin should enable this feature changing:
|
|
|
|
/proc/sys/net/core/bpf_jit_enable
|
|
|
|
/proc/sys/net/core/bpf_jit_harden (optional)
|
|
|
|
/proc/sys/net/core/bpf_jit_kallsyms (optional)
|
|
|
|
|
|
|
|
config BPF_JIT_ALWAYS_ON
|
|
|
|
bool "Permanently enable BPF JIT and remove BPF interpreter"
|
|
|
|
depends on BPF_SYSCALL && HAVE_EBPF_JIT && BPF_JIT
|
|
|
|
help
|
|
|
|
Enables BPF JIT and removes BPF interpreter to avoid speculative
|
|
|
|
execution of BPF instructions by the interpreter.
|
|
|
|
|
2022-02-22 17:57:05 +08:00
|
|
|
When CONFIG_BPF_JIT_ALWAYS_ON is enabled, /proc/sys/net/core/bpf_jit_enable
|
|
|
|
is permanently set to 1 and setting any other value than that will
|
|
|
|
return failure.
|
|
|
|
|
2021-05-11 22:35:16 +02:00
|
|
|
config BPF_JIT_DEFAULT_ON
|
|
|
|
def_bool ARCH_WANT_DEFAULT_BPF_JIT || BPF_JIT_ALWAYS_ON
|
|
|
|
depends on HAVE_EBPF_JIT && BPF_JIT
|
|
|
|
|
2021-05-11 22:35:17 +02:00
|
|
|
config BPF_UNPRIV_DEFAULT_OFF
|
|
|
|
bool "Disable unprivileged BPF by default"
|
2021-10-29 12:43:54 -07:00
|
|
|
default y
|
2021-05-11 22:35:17 +02:00
|
|
|
depends on BPF_SYSCALL
|
|
|
|
help
|
|
|
|
Disables unprivileged BPF by default by setting the corresponding
|
|
|
|
/proc/sys/kernel/unprivileged_bpf_disabled knob to 2. An admin can
|
|
|
|
still reenable it by setting it to 0 later on, or permanently
|
|
|
|
disable it by setting it to 1 (from which no other transition to
|
|
|
|
0 is possible anymore).
|
|
|
|
|
2021-10-29 12:43:54 -07:00
|
|
|
Unprivileged BPF could be used to exploit certain potential
|
|
|
|
speculative execution side-channel vulnerabilities on unmitigated
|
|
|
|
affected hardware.
|
|
|
|
|
|
|
|
If you are unsure how to answer this question, answer Y.
|
|
|
|
|
2021-05-11 22:35:16 +02:00
|
|
|
source "kernel/bpf/preload/Kconfig"
|
|
|
|
|
|
|
|
config BPF_LSM
|
|
|
|
bool "Enable BPF LSM Instrumentation"
|
|
|
|
depends on BPF_EVENTS
|
|
|
|
depends on BPF_SYSCALL
|
|
|
|
depends on SECURITY
|
|
|
|
depends on BPF_JIT
|
|
|
|
help
|
|
|
|
Enables instrumentation of the security hooks with BPF programs for
|
|
|
|
implementing dynamic MAC and Audit Policies.
|
|
|
|
|
|
|
|
If you are unsure how to answer this question, answer N.
|
|
|
|
|
|
|
|
endmenu # "BPF subsystem"
|