bpf: Add generic attach/detach/query API for multi-progs
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.
The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:
I cannot help but feel that priority logic copy-paste from old tc, netfilter
and friends is done because "that's how things were done in the past". [...]
Priority gets exposed everywhere in uapi all the way to bpftool when it's
right there for users to understand. And that's the main problem with it.
The user don't want to and don't need to be aware of it, but uapi forces them
to pick the priority. [...] Your cover letter [0] example proves that in
real life different service pick the same priority. They simply don't know
any better. Priority is an unnecessary magic that apps _have_ to pick, so
they just copy-paste and everyone ends up using the same.
The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.
Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:
- Support prog-based attach/detach and link API
- Dependency directives (can also be combined):
- BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
- BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
space application does not need CAP_SYS_ADMIN to retrieve foreign fds
via bpf_*_get_fd_by_id()
- BPF_F_LINK flag as {prog,link} toggle
- If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
BPF_F_AFTER will just append for attaching
- Enforced only at attach time
- BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
own infra for replacing their internal prog
- If no flags are set, then it's default append behavior for attaching
- Internal revision counter and optionally being able to pass expected_revision
- User space application can query current state with revision, and pass it
along for attachment to assert current state before doing updates
- Query also gets extension for link_ids array and link_attach_flags:
- prog_ids are always filled with program IDs
- link_ids are filled with link IDs when link was used, otherwise 0
- {prog,link}_attach_flags for holding {prog,link}-specific flags
- Must be easy to integrate/reuse for in-kernel users
The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.
The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.
The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.
An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.
Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.
[0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
[1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
[2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 16:08:51 +02:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
|
|
|
/* Copyright (c) 2023 Isovalent */
|
|
|
|
#ifndef __BPF_MPROG_H
|
|
|
|
#define __BPF_MPROG_H
|
|
|
|
|
|
|
|
#include <linux/bpf.h>
|
|
|
|
|
|
|
|
/* bpf_mprog framework:
|
|
|
|
*
|
|
|
|
* bpf_mprog is a generic layer for multi-program attachment. In-kernel users
|
|
|
|
* of the bpf_mprog don't need to care about the dependency resolution
|
|
|
|
* internals, they can just consume it with few API calls. Currently available
|
|
|
|
* dependency directives are BPF_F_{BEFORE,AFTER} which enable insertion of
|
|
|
|
* a BPF program or BPF link relative to an existing BPF program or BPF link
|
|
|
|
* inside the multi-program array as well as prepend and append behavior if
|
|
|
|
* no relative object was specified, see corresponding selftests for concrete
|
|
|
|
* examples (e.g. tc_links and tc_opts test cases of test_progs).
|
|
|
|
*
|
|
|
|
* Usage of bpf_mprog_{attach,detach,query}() core APIs with pseudo code:
|
|
|
|
*
|
|
|
|
* Attach case:
|
|
|
|
*
|
|
|
|
* struct bpf_mprog_entry *entry, *entry_new;
|
|
|
|
* int ret;
|
|
|
|
*
|
|
|
|
* // bpf_mprog user-side lock
|
|
|
|
* // fetch active @entry from attach location
|
|
|
|
* [...]
|
|
|
|
* ret = bpf_mprog_attach(entry, &entry_new, [...]);
|
|
|
|
* if (!ret) {
|
|
|
|
* if (entry != entry_new) {
|
|
|
|
* // swap @entry to @entry_new at attach location
|
|
|
|
* // ensure there are no inflight users of @entry:
|
|
|
|
* synchronize_rcu();
|
|
|
|
* }
|
|
|
|
* bpf_mprog_commit(entry);
|
|
|
|
* } else {
|
|
|
|
* // error path, bail out, propagate @ret
|
|
|
|
* }
|
|
|
|
* // bpf_mprog user-side unlock
|
|
|
|
*
|
|
|
|
* Detach case:
|
|
|
|
*
|
|
|
|
* struct bpf_mprog_entry *entry, *entry_new;
|
|
|
|
* int ret;
|
|
|
|
*
|
|
|
|
* // bpf_mprog user-side lock
|
|
|
|
* // fetch active @entry from attach location
|
|
|
|
* [...]
|
|
|
|
* ret = bpf_mprog_detach(entry, &entry_new, [...]);
|
|
|
|
* if (!ret) {
|
|
|
|
* // all (*) marked is optional and depends on the use-case
|
|
|
|
* // whether bpf_mprog_bundle should be freed or not
|
|
|
|
* if (!bpf_mprog_total(entry_new)) (*)
|
|
|
|
* entry_new = NULL (*)
|
|
|
|
* // swap @entry to @entry_new at attach location
|
|
|
|
* // ensure there are no inflight users of @entry:
|
|
|
|
* synchronize_rcu();
|
|
|
|
* bpf_mprog_commit(entry);
|
|
|
|
* if (!entry_new) (*)
|
|
|
|
* // free bpf_mprog_bundle (*)
|
|
|
|
* } else {
|
|
|
|
* // error path, bail out, propagate @ret
|
|
|
|
* }
|
|
|
|
* // bpf_mprog user-side unlock
|
|
|
|
*
|
|
|
|
* Query case:
|
|
|
|
*
|
|
|
|
* struct bpf_mprog_entry *entry;
|
|
|
|
* int ret;
|
|
|
|
*
|
|
|
|
* // bpf_mprog user-side lock
|
|
|
|
* // fetch active @entry from attach location
|
|
|
|
* [...]
|
|
|
|
* ret = bpf_mprog_query(attr, uattr, entry);
|
|
|
|
* // bpf_mprog user-side unlock
|
|
|
|
*
|
|
|
|
* Data/fast path:
|
|
|
|
*
|
|
|
|
* struct bpf_mprog_entry *entry;
|
|
|
|
* struct bpf_mprog_fp *fp;
|
|
|
|
* struct bpf_prog *prog;
|
|
|
|
* int ret = [...];
|
|
|
|
*
|
|
|
|
* rcu_read_lock();
|
|
|
|
* // fetch active @entry from attach location
|
|
|
|
* [...]
|
|
|
|
* bpf_mprog_foreach_prog(entry, fp, prog) {
|
|
|
|
* ret = bpf_prog_run(prog, [...]);
|
|
|
|
* // process @ret from program
|
|
|
|
* }
|
|
|
|
* [...]
|
|
|
|
* rcu_read_unlock();
|
|
|
|
*
|
|
|
|
* bpf_mprog locking considerations:
|
|
|
|
*
|
|
|
|
* bpf_mprog_{attach,detach,query}() must be protected by an external lock
|
|
|
|
* (like RTNL in case of tcx).
|
|
|
|
*
|
|
|
|
* bpf_mprog_entry pointer can be an __rcu annotated pointer (in case of tcx
|
|
|
|
* the netdevice has tcx_ingress and tcx_egress __rcu pointer) which gets
|
|
|
|
* updated via rcu_assign_pointer() pointing to the active bpf_mprog_entry of
|
|
|
|
* the bpf_mprog_bundle.
|
|
|
|
*
|
|
|
|
* Fast path accesses the active bpf_mprog_entry within RCU critical section
|
|
|
|
* (in case of tcx it runs in NAPI which provides RCU protection there,
|
|
|
|
* other users might need explicit rcu_read_lock()). The bpf_mprog_commit()
|
|
|
|
* assumes that for the old bpf_mprog_entry there are no inflight users
|
|
|
|
* anymore.
|
|
|
|
*
|
|
|
|
* The READ_ONCE()/WRITE_ONCE() pairing for bpf_mprog_fp's prog access is for
|
|
|
|
* the replacement case where we don't swap the bpf_mprog_entry.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define bpf_mprog_foreach_tuple(entry, fp, cp, t) \
|
|
|
|
for (fp = &entry->fp_items[0], cp = &entry->parent->cp_items[0];\
|
|
|
|
({ \
|
|
|
|
t.prog = READ_ONCE(fp->prog); \
|
|
|
|
t.link = cp->link; \
|
|
|
|
t.prog; \
|
|
|
|
}); \
|
|
|
|
fp++, cp++)
|
|
|
|
|
|
|
|
#define bpf_mprog_foreach_prog(entry, fp, p) \
|
|
|
|
for (fp = &entry->fp_items[0]; \
|
|
|
|
(p = READ_ONCE(fp->prog)); \
|
|
|
|
fp++)
|
|
|
|
|
|
|
|
#define BPF_MPROG_MAX 64
|
|
|
|
|
|
|
|
struct bpf_mprog_fp {
|
|
|
|
struct bpf_prog *prog;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct bpf_mprog_cp {
|
|
|
|
struct bpf_link *link;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct bpf_mprog_entry {
|
|
|
|
struct bpf_mprog_fp fp_items[BPF_MPROG_MAX];
|
|
|
|
struct bpf_mprog_bundle *parent;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct bpf_mprog_bundle {
|
|
|
|
struct bpf_mprog_entry a;
|
|
|
|
struct bpf_mprog_entry b;
|
|
|
|
struct bpf_mprog_cp cp_items[BPF_MPROG_MAX];
|
|
|
|
struct bpf_prog *ref;
|
|
|
|
atomic64_t revision;
|
|
|
|
u32 count;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct bpf_tuple {
|
|
|
|
struct bpf_prog *prog;
|
|
|
|
struct bpf_link *link;
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct bpf_mprog_entry *
|
|
|
|
bpf_mprog_peer(const struct bpf_mprog_entry *entry)
|
|
|
|
{
|
|
|
|
if (entry == &entry->parent->a)
|
|
|
|
return &entry->parent->b;
|
|
|
|
else
|
|
|
|
return &entry->parent->a;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_bundle_init(struct bpf_mprog_bundle *bundle)
|
|
|
|
{
|
|
|
|
BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64));
|
|
|
|
BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) !=
|
|
|
|
ARRAY_SIZE(bundle->cp_items));
|
|
|
|
|
|
|
|
memset(bundle, 0, sizeof(*bundle));
|
|
|
|
atomic64_set(&bundle->revision, 1);
|
|
|
|
bundle->a.parent = bundle;
|
|
|
|
bundle->b.parent = bundle;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_inc(struct bpf_mprog_entry *entry)
|
|
|
|
{
|
|
|
|
entry->parent->count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_dec(struct bpf_mprog_entry *entry)
|
|
|
|
{
|
|
|
|
entry->parent->count--;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int bpf_mprog_max(void)
|
|
|
|
{
|
|
|
|
return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int bpf_mprog_total(struct bpf_mprog_entry *entry)
|
|
|
|
{
|
|
|
|
int total = entry->parent->count;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(total > bpf_mprog_max());
|
|
|
|
return total;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry,
|
|
|
|
struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
const struct bpf_mprog_fp *fp;
|
|
|
|
const struct bpf_prog *tmp;
|
|
|
|
|
|
|
|
bpf_mprog_foreach_prog(entry, fp, tmp) {
|
|
|
|
if (tmp == prog)
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_mark_for_release(struct bpf_mprog_entry *entry,
|
|
|
|
struct bpf_tuple *tuple)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(entry->parent->ref);
|
|
|
|
if (!tuple->link)
|
|
|
|
entry->parent->ref = tuple->prog;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_complete_release(struct bpf_mprog_entry *entry)
|
|
|
|
{
|
|
|
|
/* In the non-link case prog deletions can only drop the reference
|
|
|
|
* to the prog after the bpf_mprog_entry got swapped and the
|
|
|
|
* bpf_mprog ensured that there are no inflight users anymore.
|
|
|
|
*
|
|
|
|
* Paired with bpf_mprog_mark_for_release().
|
|
|
|
*/
|
|
|
|
if (entry->parent->ref) {
|
|
|
|
bpf_prog_put(entry->parent->ref);
|
|
|
|
entry->parent->ref = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_revision_new(struct bpf_mprog_entry *entry)
|
|
|
|
{
|
|
|
|
atomic64_inc(&entry->parent->revision);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry)
|
|
|
|
{
|
|
|
|
bpf_mprog_complete_release(entry);
|
|
|
|
bpf_mprog_revision_new(entry);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u64 bpf_mprog_revision(struct bpf_mprog_entry *entry)
|
|
|
|
{
|
|
|
|
return atomic64_read(&entry->parent->revision);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_entry_copy(struct bpf_mprog_entry *dst,
|
|
|
|
struct bpf_mprog_entry *src)
|
|
|
|
{
|
|
|
|
memcpy(dst->fp_items, src->fp_items, sizeof(src->fp_items));
|
|
|
|
}
|
|
|
|
|
2023-07-28 23:47:17 +02:00
|
|
|
static inline void bpf_mprog_entry_clear(struct bpf_mprog_entry *dst)
|
|
|
|
{
|
|
|
|
memset(dst->fp_items, 0, sizeof(dst->fp_items));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_clear_all(struct bpf_mprog_entry *entry,
|
|
|
|
struct bpf_mprog_entry **entry_new)
|
|
|
|
{
|
|
|
|
struct bpf_mprog_entry *peer;
|
|
|
|
|
|
|
|
peer = bpf_mprog_peer(entry);
|
|
|
|
bpf_mprog_entry_clear(peer);
|
|
|
|
peer->parent->count = 0;
|
|
|
|
*entry_new = peer;
|
|
|
|
}
|
|
|
|
|
bpf: Add generic attach/detach/query API for multi-progs
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.
The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:
I cannot help but feel that priority logic copy-paste from old tc, netfilter
and friends is done because "that's how things were done in the past". [...]
Priority gets exposed everywhere in uapi all the way to bpftool when it's
right there for users to understand. And that's the main problem with it.
The user don't want to and don't need to be aware of it, but uapi forces them
to pick the priority. [...] Your cover letter [0] example proves that in
real life different service pick the same priority. They simply don't know
any better. Priority is an unnecessary magic that apps _have_ to pick, so
they just copy-paste and everyone ends up using the same.
The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.
Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:
- Support prog-based attach/detach and link API
- Dependency directives (can also be combined):
- BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
- BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
space application does not need CAP_SYS_ADMIN to retrieve foreign fds
via bpf_*_get_fd_by_id()
- BPF_F_LINK flag as {prog,link} toggle
- If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
BPF_F_AFTER will just append for attaching
- Enforced only at attach time
- BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
own infra for replacing their internal prog
- If no flags are set, then it's default append behavior for attaching
- Internal revision counter and optionally being able to pass expected_revision
- User space application can query current state with revision, and pass it
along for attachment to assert current state before doing updates
- Query also gets extension for link_ids array and link_attach_flags:
- prog_ids are always filled with program IDs
- link_ids are filled with link IDs when link was used, otherwise 0
- {prog,link}_attach_flags for holding {prog,link}-specific flags
- Must be easy to integrate/reuse for in-kernel users
The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.
The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.
The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.
An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.
Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.
[0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
[1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
[2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 16:08:51 +02:00
|
|
|
static inline void bpf_mprog_entry_grow(struct bpf_mprog_entry *entry, int idx)
|
|
|
|
{
|
|
|
|
int total = bpf_mprog_total(entry);
|
|
|
|
|
|
|
|
memmove(entry->fp_items + idx + 1,
|
|
|
|
entry->fp_items + idx,
|
|
|
|
(total - idx) * sizeof(struct bpf_mprog_fp));
|
|
|
|
|
|
|
|
memmove(entry->parent->cp_items + idx + 1,
|
|
|
|
entry->parent->cp_items + idx,
|
|
|
|
(total - idx) * sizeof(struct bpf_mprog_cp));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_entry_shrink(struct bpf_mprog_entry *entry, int idx)
|
|
|
|
{
|
|
|
|
/* Total array size is needed in this case to enure the NULL
|
|
|
|
* entry is copied at the end.
|
|
|
|
*/
|
|
|
|
int total = ARRAY_SIZE(entry->fp_items);
|
|
|
|
|
|
|
|
memmove(entry->fp_items + idx,
|
|
|
|
entry->fp_items + idx + 1,
|
|
|
|
(total - idx - 1) * sizeof(struct bpf_mprog_fp));
|
|
|
|
|
|
|
|
memmove(entry->parent->cp_items + idx,
|
|
|
|
entry->parent->cp_items + idx + 1,
|
|
|
|
(total - idx - 1) * sizeof(struct bpf_mprog_cp));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_read(struct bpf_mprog_entry *entry, u32 idx,
|
|
|
|
struct bpf_mprog_fp **fp,
|
|
|
|
struct bpf_mprog_cp **cp)
|
|
|
|
{
|
|
|
|
*fp = &entry->fp_items[idx];
|
|
|
|
*cp = &entry->parent->cp_items[idx];
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bpf_mprog_write(struct bpf_mprog_fp *fp,
|
|
|
|
struct bpf_mprog_cp *cp,
|
|
|
|
struct bpf_tuple *tuple)
|
|
|
|
{
|
|
|
|
WRITE_ONCE(fp->prog, tuple->prog);
|
|
|
|
cp->link = tuple->link;
|
|
|
|
}
|
|
|
|
|
|
|
|
int bpf_mprog_attach(struct bpf_mprog_entry *entry,
|
|
|
|
struct bpf_mprog_entry **entry_new,
|
|
|
|
struct bpf_prog *prog_new, struct bpf_link *link,
|
|
|
|
struct bpf_prog *prog_old,
|
|
|
|
u32 flags, u32 id_or_fd, u64 revision);
|
|
|
|
|
|
|
|
int bpf_mprog_detach(struct bpf_mprog_entry *entry,
|
|
|
|
struct bpf_mprog_entry **entry_new,
|
|
|
|
struct bpf_prog *prog, struct bpf_link *link,
|
|
|
|
u32 flags, u32 id_or_fd, u64 revision);
|
|
|
|
|
|
|
|
int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
|
|
|
|
struct bpf_mprog_entry *entry);
|
|
|
|
|
bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 16:08:52 +02:00
|
|
|
static inline bool bpf_mprog_supported(enum bpf_prog_type type)
|
|
|
|
{
|
|
|
|
switch (type) {
|
|
|
|
case BPF_PROG_TYPE_SCHED_CLS:
|
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
bpf: Add generic attach/detach/query API for multi-progs
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.
The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:
I cannot help but feel that priority logic copy-paste from old tc, netfilter
and friends is done because "that's how things were done in the past". [...]
Priority gets exposed everywhere in uapi all the way to bpftool when it's
right there for users to understand. And that's the main problem with it.
The user don't want to and don't need to be aware of it, but uapi forces them
to pick the priority. [...] Your cover letter [0] example proves that in
real life different service pick the same priority. They simply don't know
any better. Priority is an unnecessary magic that apps _have_ to pick, so
they just copy-paste and everyone ends up using the same.
The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.
Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:
- Support prog-based attach/detach and link API
- Dependency directives (can also be combined):
- BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
- BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
space application does not need CAP_SYS_ADMIN to retrieve foreign fds
via bpf_*_get_fd_by_id()
- BPF_F_LINK flag as {prog,link} toggle
- If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
BPF_F_AFTER will just append for attaching
- Enforced only at attach time
- BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
own infra for replacing their internal prog
- If no flags are set, then it's default append behavior for attaching
- Internal revision counter and optionally being able to pass expected_revision
- User space application can query current state with revision, and pass it
along for attachment to assert current state before doing updates
- Query also gets extension for link_ids array and link_attach_flags:
- prog_ids are always filled with program IDs
- link_ids are filled with link IDs when link was used, otherwise 0
- {prog,link}_attach_flags for holding {prog,link}-specific flags
- Must be easy to integrate/reuse for in-kernel users
The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.
The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.
The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.
An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.
Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.
[0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
[1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
[2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 16:08:51 +02:00
|
|
|
#endif /* __BPF_MPROG_H */
|