2019-06-04 10:11:33 +02:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2015-10-29 14:58:09 +01:00
|
|
|
/*
|
|
|
|
* Minimal file system backend for holding eBPF maps and programs,
|
|
|
|
* used by bpf(2) object pinning.
|
|
|
|
*
|
|
|
|
* Authors:
|
|
|
|
*
|
|
|
|
* Daniel Borkmann <daniel@iogearbox.net>
|
|
|
|
*/
|
|
|
|
|
2016-07-11 12:51:01 -04:00
|
|
|
#include <linux/init.h>
|
2015-10-29 14:58:09 +01:00
|
|
|
#include <linux/magic.h>
|
|
|
|
#include <linux/major.h>
|
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/namei.h>
|
|
|
|
#include <linux/fs.h>
|
2019-03-22 14:58:36 +00:00
|
|
|
#include <linux/fs_context.h>
|
|
|
|
#include <linux/fs_parser.h>
|
2015-10-29 14:58:09 +01:00
|
|
|
#include <linux/kdev_t.h>
|
|
|
|
#include <linux/filter.h>
|
|
|
|
#include <linux/bpf.h>
|
bpf: add initial bpf tracepoints
This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.
For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.
Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:
# ./perf record -a -e bpf:* sleep 10
# ./perf script
sock_example 6197 [005] 283.980322: bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
sock_example 6197 [005] 283.980721: bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
sock_example 6197 [005] 283.988423: bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
sock_example 6197 [005] 283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
[...]
sock_example 6197 [005] 288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
swapper 0 [005] 289.338243: bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
[1] https://lwn.net/Articles/705270/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 02:28:18 +01:00
|
|
|
#include <linux/bpf_trace.h>
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
#include <linux/kstrtox.h>
|
2020-08-18 21:27:58 -07:00
|
|
|
#include "preload/bpf_preload.h"
|
2015-10-29 14:58:09 +01:00
|
|
|
|
|
|
|
enum bpf_type {
|
|
|
|
BPF_TYPE_UNSPEC = 0,
|
|
|
|
BPF_TYPE_PROG,
|
|
|
|
BPF_TYPE_MAP,
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
BPF_TYPE_LINK,
|
2015-10-29 14:58:09 +01:00
|
|
|
};
|
|
|
|
|
|
|
|
static void *bpf_any_get(void *raw, enum bpf_type type)
|
|
|
|
{
|
|
|
|
switch (type) {
|
|
|
|
case BPF_TYPE_PROG:
|
2019-11-17 09:28:03 -08:00
|
|
|
bpf_prog_inc(raw);
|
2015-10-29 14:58:09 +01:00
|
|
|
break;
|
|
|
|
case BPF_TYPE_MAP:
|
bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails
92117d8443bc ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
(32k). Due to using 32-bit counter, it's possible in practice to overflow
refcounter and make it wrap around to 0, causing erroneous map free, while
there are still references to it, causing use-after-free problems.
But having a failing refcounting operations are problematic in some cases. One
example is mmap() interface. After establishing initial memory-mapping, user
is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
splitting it into multiple non-contiguous regions. All this happening without
any control from the users of mmap subsystem. Rather mmap subsystem sends
notifications to original creator of memory mapping through open/close
callbacks, which are optionally specified during initial memory mapping
creation. These callbacks are used to maintain accurate refcount for bpf_map
(see next patch in this series). The problem is that open() callback is not
supposed to fail, because memory-mapped resource is set up and properly
referenced. This is posing a problem for using memory-mapping with BPF maps.
One solution to this is to maintain separate refcount for just memory-mappings
and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
There are similar use cases in current work on tcp-bpf, necessitating extra
counter as well. This seems like a rather unfortunate and ugly solution that
doesn't scale well to various new use cases.
Another approach to solve this is to use non-failing refcount_t type, which
uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
stays there. This utlimately causes memory leak, but prevents use after free.
But given refcounting is not the most performance-critical operation with BPF
maps (it's not used from running BPF program code), we can also just switch to
64-bit counter that can't overflow in practice, potentially disadvantaging
32-bit platforms a tiny bit. This simplifies semantics and allows above
described scenarios to not worry about failing refcount increment operation.
In terms of struct bpf_map size, we are still good and use the same amount of
space:
BEFORE (3 cache lines, 8 bytes of padding at the end):
struct bpf_map {
const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
struct bpf_map * inner_map_meta; /* 8 8 */
void * security; /* 16 8 */
enum bpf_map_type map_type; /* 24 4 */
u32 key_size; /* 28 4 */
u32 value_size; /* 32 4 */
u32 max_entries; /* 36 4 */
u32 map_flags; /* 40 4 */
int spin_lock_off; /* 44 4 */
u32 id; /* 48 4 */
int numa_node; /* 52 4 */
u32 btf_key_type_id; /* 56 4 */
u32 btf_value_type_id; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct btf * btf; /* 64 8 */
struct bpf_map_memory memory; /* 72 16 */
bool unpriv_array; /* 88 1 */
bool frozen; /* 89 1 */
/* XXX 38 bytes hole, try to pack */
/* --- cacheline 2 boundary (128 bytes) --- */
atomic_t refcnt __attribute__((__aligned__(64))); /* 128 4 */
atomic_t usercnt; /* 132 4 */
struct work_struct work; /* 136 32 */
char name[16]; /* 168 16 */
/* size: 192, cachelines: 3, members: 21 */
/* sum members: 146, holes: 1, sum holes: 38 */
/* padding: 8 */
/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));
AFTER (same 3 cache lines, no extra padding now):
struct bpf_map {
const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
struct bpf_map * inner_map_meta; /* 8 8 */
void * security; /* 16 8 */
enum bpf_map_type map_type; /* 24 4 */
u32 key_size; /* 28 4 */
u32 value_size; /* 32 4 */
u32 max_entries; /* 36 4 */
u32 map_flags; /* 40 4 */
int spin_lock_off; /* 44 4 */
u32 id; /* 48 4 */
int numa_node; /* 52 4 */
u32 btf_key_type_id; /* 56 4 */
u32 btf_value_type_id; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct btf * btf; /* 64 8 */
struct bpf_map_memory memory; /* 72 16 */
bool unpriv_array; /* 88 1 */
bool frozen; /* 89 1 */
/* XXX 38 bytes hole, try to pack */
/* --- cacheline 2 boundary (128 bytes) --- */
atomic64_t refcnt __attribute__((__aligned__(64))); /* 128 8 */
atomic64_t usercnt; /* 136 8 */
struct work_struct work; /* 144 32 */
char name[16]; /* 176 16 */
/* size: 192, cachelines: 3, members: 21 */
/* sum members: 154, holes: 1, sum holes: 38 */
/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));
This patch, while modifying all users of bpf_map_inc, also cleans up its
interface to match bpf_map_put with separate operations for bpf_map_inc and
bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
respectively). Also, given there are no users of bpf_map_inc_not_zero
specifying uref=true, remove uref flag and default to uref=false internally.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
2019-11-17 09:28:02 -08:00
|
|
|
bpf_map_inc_with_uref(raw);
|
2015-10-29 14:58:09 +01:00
|
|
|
break;
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
case BPF_TYPE_LINK:
|
|
|
|
bpf_link_inc(raw);
|
|
|
|
break;
|
2015-10-29 14:58:09 +01:00
|
|
|
default:
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return raw;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_any_put(void *raw, enum bpf_type type)
|
|
|
|
{
|
|
|
|
switch (type) {
|
|
|
|
case BPF_TYPE_PROG:
|
|
|
|
bpf_prog_put(raw);
|
|
|
|
break;
|
|
|
|
case BPF_TYPE_MAP:
|
bpf: fix clearing on persistent program array maps
Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.
Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.
Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.
Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.
Joint work with Alexei.
Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24 21:28:15 +01:00
|
|
|
bpf_map_put_with_uref(raw);
|
2015-10-29 14:58:09 +01:00
|
|
|
break;
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
case BPF_TYPE_LINK:
|
|
|
|
bpf_link_put(raw);
|
|
|
|
break;
|
2015-10-29 14:58:09 +01:00
|
|
|
default:
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
|
|
|
|
{
|
|
|
|
void *raw;
|
|
|
|
|
bpf: fix clearing on persistent program array maps
Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.
Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.
Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.
Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.
Joint work with Alexei.
Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24 21:28:15 +01:00
|
|
|
raw = bpf_map_get_with_uref(ufd);
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
if (!IS_ERR(raw)) {
|
|
|
|
*type = BPF_TYPE_MAP;
|
|
|
|
return raw;
|
|
|
|
}
|
|
|
|
|
|
|
|
raw = bpf_prog_get(ufd);
|
|
|
|
if (!IS_ERR(raw)) {
|
2015-10-29 14:58:09 +01:00
|
|
|
*type = BPF_TYPE_PROG;
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
return raw;
|
2015-10-29 14:58:09 +01:00
|
|
|
}
|
|
|
|
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
raw = bpf_link_get_from_fd(ufd);
|
|
|
|
if (!IS_ERR(raw)) {
|
|
|
|
*type = BPF_TYPE_LINK;
|
|
|
|
return raw;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ERR_PTR(-EINVAL);
|
2015-10-29 14:58:09 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static const struct inode_operations bpf_dir_iops;
|
|
|
|
|
|
|
|
static const struct inode_operations bpf_prog_iops = { };
|
|
|
|
static const struct inode_operations bpf_map_iops = { };
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
static const struct inode_operations bpf_link_iops = { };
|
2015-10-29 14:58:09 +01:00
|
|
|
|
bpf: Introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token. Also creating BPF token in init user namespace is
currently not supported, given BPF token doesn't have any effect in init
user namespace anyways.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
2024-01-23 18:21:00 -08:00
|
|
|
struct inode *bpf_get_inode(struct super_block *sb,
|
|
|
|
const struct inode *dir,
|
|
|
|
umode_t mode)
|
2015-10-29 14:58:09 +01:00
|
|
|
{
|
|
|
|
struct inode *inode;
|
|
|
|
|
|
|
|
switch (mode & S_IFMT) {
|
|
|
|
case S_IFDIR:
|
|
|
|
case S_IFREG:
|
bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 02:30:46 +02:00
|
|
|
case S_IFLNK:
|
2015-10-29 14:58:09 +01:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
inode = new_inode(sb);
|
|
|
|
if (!inode)
|
|
|
|
return ERR_PTR(-ENOSPC);
|
|
|
|
|
|
|
|
inode->i_ino = get_next_ino();
|
2023-10-04 14:53:06 -04:00
|
|
|
simple_inode_init_ts(inode);
|
2015-10-29 14:58:09 +01:00
|
|
|
|
2023-01-13 12:49:25 +01:00
|
|
|
inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
|
2015-10-29 14:58:09 +01:00
|
|
|
|
|
|
|
return inode;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int bpf_inode_type(const struct inode *inode, enum bpf_type *type)
|
|
|
|
{
|
|
|
|
*type = BPF_TYPE_UNSPEC;
|
|
|
|
if (inode->i_op == &bpf_prog_iops)
|
|
|
|
*type = BPF_TYPE_PROG;
|
|
|
|
else if (inode->i_op == &bpf_map_iops)
|
|
|
|
*type = BPF_TYPE_MAP;
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
else if (inode->i_op == &bpf_link_iops)
|
|
|
|
*type = BPF_TYPE_LINK;
|
2015-10-29 14:58:09 +01:00
|
|
|
else
|
|
|
|
return -EACCES;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 02:30:46 +02:00
|
|
|
static void bpf_dentry_finalize(struct dentry *dentry, struct inode *inode,
|
|
|
|
struct inode *dir)
|
|
|
|
{
|
|
|
|
d_instantiate(dentry, inode);
|
|
|
|
dget(dentry);
|
|
|
|
|
2023-10-04 14:53:06 -04:00
|
|
|
inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
|
bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 02:30:46 +02:00
|
|
|
}
|
|
|
|
|
2023-01-13 12:49:15 +01:00
|
|
|
static int bpf_mkdir(struct mnt_idmap *idmap, struct inode *dir,
|
2021-01-21 14:19:43 +01:00
|
|
|
struct dentry *dentry, umode_t mode)
|
2015-10-29 14:58:09 +01:00
|
|
|
{
|
|
|
|
struct inode *inode;
|
|
|
|
|
|
|
|
inode = bpf_get_inode(dir->i_sb, dir, mode | S_IFDIR);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return PTR_ERR(inode);
|
|
|
|
|
|
|
|
inode->i_op = &bpf_dir_iops;
|
|
|
|
inode->i_fop = &simple_dir_operations;
|
|
|
|
|
|
|
|
inc_nlink(inode);
|
|
|
|
inc_nlink(dir);
|
|
|
|
|
bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 02:30:46 +02:00
|
|
|
bpf_dentry_finalize(dentry, inode, dir);
|
2015-10-29 14:58:09 +01:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-04-18 15:56:03 -07:00
|
|
|
struct map_iter {
|
|
|
|
void *key;
|
|
|
|
bool done;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct map_iter *map_iter(struct seq_file *m)
|
|
|
|
{
|
|
|
|
return m->private;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct bpf_map *seq_file_to_map(struct seq_file *m)
|
|
|
|
{
|
|
|
|
return file_inode(m->file)->i_private;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void map_iter_free(struct map_iter *iter)
|
|
|
|
{
|
|
|
|
if (iter) {
|
|
|
|
kfree(iter->key);
|
|
|
|
kfree(iter);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct map_iter *map_iter_alloc(struct bpf_map *map)
|
|
|
|
{
|
|
|
|
struct map_iter *iter;
|
|
|
|
|
|
|
|
iter = kzalloc(sizeof(*iter), GFP_KERNEL | __GFP_NOWARN);
|
|
|
|
if (!iter)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
iter->key = kzalloc(map->key_size, GFP_KERNEL | __GFP_NOWARN);
|
|
|
|
if (!iter->key)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
return iter;
|
|
|
|
|
|
|
|
error:
|
|
|
|
map_iter_free(iter);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *map_seq_next(struct seq_file *m, void *v, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct bpf_map *map = seq_file_to_map(m);
|
|
|
|
void *key = map_iter(m)->key;
|
2018-08-09 08:55:19 -07:00
|
|
|
void *prev_key;
|
2018-04-18 15:56:03 -07:00
|
|
|
|
2020-01-25 12:10:02 +03:00
|
|
|
(*pos)++;
|
2018-04-18 15:56:03 -07:00
|
|
|
if (map_iter(m)->done)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (unlikely(v == SEQ_START_TOKEN))
|
2018-08-09 08:55:19 -07:00
|
|
|
prev_key = NULL;
|
|
|
|
else
|
|
|
|
prev_key = key;
|
2018-04-18 15:56:03 -07:00
|
|
|
|
bpf: Fix a rcu warning for bpffs map pretty-print
Running selftest
./btf_btf -p
the kernel had the following warning:
[ 51.528185] WARNING: CPU: 3 PID: 1756 at kernel/bpf/hashtab.c:717 htab_map_get_next_key+0x2eb/0x300
[ 51.529217] Modules linked in:
[ 51.529583] CPU: 3 PID: 1756 Comm: test_btf Not tainted 5.9.0-rc1+ #878
[ 51.530346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
[ 51.531410] RIP: 0010:htab_map_get_next_key+0x2eb/0x300
...
[ 51.542826] Call Trace:
[ 51.543119] map_seq_next+0x53/0x80
[ 51.543528] seq_read+0x263/0x400
[ 51.543932] vfs_read+0xad/0x1c0
[ 51.544311] ksys_read+0x5f/0xe0
[ 51.544689] do_syscall_64+0x33/0x40
[ 51.545116] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The related source code in kernel/bpf/hashtab.c:
709 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
710 {
711 struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
712 struct hlist_nulls_head *head;
713 struct htab_elem *l, *next_l;
714 u32 hash, key_size;
715 int i = 0;
716
717 WARN_ON_ONCE(!rcu_read_lock_held());
In kernel/bpf/inode.c, bpffs map pretty print calls map->ops->map_get_next_key()
without holding a rcu_read_lock(), hence causing the above warning.
To fix the issue, just surrounding map->ops->map_get_next_key() with rcu read lock.
Fixes: a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200916004401.146277-1-yhs@fb.com
2020-09-15 17:44:01 -07:00
|
|
|
rcu_read_lock();
|
2018-08-09 08:55:19 -07:00
|
|
|
if (map->ops->map_get_next_key(map, prev_key, key)) {
|
2018-04-18 15:56:03 -07:00
|
|
|
map_iter(m)->done = true;
|
bpf: Fix a rcu warning for bpffs map pretty-print
Running selftest
./btf_btf -p
the kernel had the following warning:
[ 51.528185] WARNING: CPU: 3 PID: 1756 at kernel/bpf/hashtab.c:717 htab_map_get_next_key+0x2eb/0x300
[ 51.529217] Modules linked in:
[ 51.529583] CPU: 3 PID: 1756 Comm: test_btf Not tainted 5.9.0-rc1+ #878
[ 51.530346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
[ 51.531410] RIP: 0010:htab_map_get_next_key+0x2eb/0x300
...
[ 51.542826] Call Trace:
[ 51.543119] map_seq_next+0x53/0x80
[ 51.543528] seq_read+0x263/0x400
[ 51.543932] vfs_read+0xad/0x1c0
[ 51.544311] ksys_read+0x5f/0xe0
[ 51.544689] do_syscall_64+0x33/0x40
[ 51.545116] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The related source code in kernel/bpf/hashtab.c:
709 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
710 {
711 struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
712 struct hlist_nulls_head *head;
713 struct htab_elem *l, *next_l;
714 u32 hash, key_size;
715 int i = 0;
716
717 WARN_ON_ONCE(!rcu_read_lock_held());
In kernel/bpf/inode.c, bpffs map pretty print calls map->ops->map_get_next_key()
without holding a rcu_read_lock(), hence causing the above warning.
To fix the issue, just surrounding map->ops->map_get_next_key() with rcu read lock.
Fixes: a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200916004401.146277-1-yhs@fb.com
2020-09-15 17:44:01 -07:00
|
|
|
key = NULL;
|
2018-04-18 15:56:03 -07:00
|
|
|
}
|
bpf: Fix a rcu warning for bpffs map pretty-print
Running selftest
./btf_btf -p
the kernel had the following warning:
[ 51.528185] WARNING: CPU: 3 PID: 1756 at kernel/bpf/hashtab.c:717 htab_map_get_next_key+0x2eb/0x300
[ 51.529217] Modules linked in:
[ 51.529583] CPU: 3 PID: 1756 Comm: test_btf Not tainted 5.9.0-rc1+ #878
[ 51.530346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
[ 51.531410] RIP: 0010:htab_map_get_next_key+0x2eb/0x300
...
[ 51.542826] Call Trace:
[ 51.543119] map_seq_next+0x53/0x80
[ 51.543528] seq_read+0x263/0x400
[ 51.543932] vfs_read+0xad/0x1c0
[ 51.544311] ksys_read+0x5f/0xe0
[ 51.544689] do_syscall_64+0x33/0x40
[ 51.545116] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The related source code in kernel/bpf/hashtab.c:
709 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
710 {
711 struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
712 struct hlist_nulls_head *head;
713 struct htab_elem *l, *next_l;
714 u32 hash, key_size;
715 int i = 0;
716
717 WARN_ON_ONCE(!rcu_read_lock_held());
In kernel/bpf/inode.c, bpffs map pretty print calls map->ops->map_get_next_key()
without holding a rcu_read_lock(), hence causing the above warning.
To fix the issue, just surrounding map->ops->map_get_next_key() with rcu read lock.
Fixes: a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200916004401.146277-1-yhs@fb.com
2020-09-15 17:44:01 -07:00
|
|
|
rcu_read_unlock();
|
2018-04-18 15:56:03 -07:00
|
|
|
return key;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *map_seq_start(struct seq_file *m, loff_t *pos)
|
|
|
|
{
|
|
|
|
if (map_iter(m)->done)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return *pos ? map_iter(m)->key : SEQ_START_TOKEN;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void map_seq_stop(struct seq_file *m, void *v)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static int map_seq_show(struct seq_file *m, void *v)
|
|
|
|
{
|
|
|
|
struct bpf_map *map = seq_file_to_map(m);
|
|
|
|
void *key = map_iter(m)->key;
|
|
|
|
|
|
|
|
if (unlikely(v == SEQ_START_TOKEN)) {
|
|
|
|
seq_puts(m, "# WARNING!! The output is for debug purpose only\n");
|
|
|
|
seq_puts(m, "# WARNING!! The output format will change\n");
|
|
|
|
} else {
|
|
|
|
map->ops->map_seq_show_elem(map, key, m);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct seq_operations bpffs_map_seq_ops = {
|
|
|
|
.start = map_seq_start,
|
|
|
|
.next = map_seq_next,
|
|
|
|
.show = map_seq_show,
|
|
|
|
.stop = map_seq_stop,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int bpffs_map_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
struct bpf_map *map = inode->i_private;
|
|
|
|
struct map_iter *iter;
|
|
|
|
struct seq_file *m;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
iter = map_iter_alloc(map);
|
|
|
|
if (!iter)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
err = seq_open(file, &bpffs_map_seq_ops);
|
|
|
|
if (err) {
|
|
|
|
map_iter_free(iter);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
m = file->private_data;
|
|
|
|
m->private = iter;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int bpffs_map_release(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
struct seq_file *m = file->private_data;
|
|
|
|
|
|
|
|
map_iter_free(map_iter(m));
|
|
|
|
|
|
|
|
return seq_release(inode, file);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* bpffs_map_fops should only implement the basic
|
|
|
|
* read operation for a BPF map. The purpose is to
|
|
|
|
* provide a simple user intuitive way to do
|
|
|
|
* "cat bpffs/pathto/a-pinned-map".
|
|
|
|
*
|
|
|
|
* Other operations (e.g. write, lookup...) should be realized by
|
|
|
|
* the userspace tools (e.g. bpftool) through the
|
|
|
|
* BPF_OBJ_GET_INFO_BY_FD and the map's lookup/update
|
|
|
|
* interface.
|
|
|
|
*/
|
|
|
|
static const struct file_operations bpffs_map_fops = {
|
|
|
|
.open = bpffs_map_open,
|
|
|
|
.read = seq_read,
|
|
|
|
.release = bpffs_map_release,
|
|
|
|
};
|
|
|
|
|
2018-06-08 18:10:34 +02:00
|
|
|
static int bpffs_obj_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct file_operations bpffs_obj_fops = {
|
|
|
|
.open = bpffs_obj_open,
|
|
|
|
};
|
|
|
|
|
2017-12-01 17:22:19 -05:00
|
|
|
static int bpf_mkobj_ops(struct dentry *dentry, umode_t mode, void *raw,
|
2018-04-18 15:56:03 -07:00
|
|
|
const struct inode_operations *iops,
|
|
|
|
const struct file_operations *fops)
|
2015-10-29 14:58:09 +01:00
|
|
|
{
|
2017-12-01 17:22:19 -05:00
|
|
|
struct inode *dir = dentry->d_parent->d_inode;
|
|
|
|
struct inode *inode = bpf_get_inode(dir->i_sb, dir, mode);
|
2015-10-29 14:58:09 +01:00
|
|
|
if (IS_ERR(inode))
|
|
|
|
return PTR_ERR(inode);
|
|
|
|
|
|
|
|
inode->i_op = iops;
|
2018-04-18 15:56:03 -07:00
|
|
|
inode->i_fop = fops;
|
2017-12-01 17:22:19 -05:00
|
|
|
inode->i_private = raw;
|
2015-10-29 14:58:09 +01:00
|
|
|
|
bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 02:30:46 +02:00
|
|
|
bpf_dentry_finalize(dentry, inode, dir);
|
2015-10-29 14:58:09 +01:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-12-01 17:22:19 -05:00
|
|
|
static int bpf_mkprog(struct dentry *dentry, umode_t mode, void *arg)
|
2015-10-29 14:58:09 +01:00
|
|
|
{
|
2018-06-08 18:10:34 +02:00
|
|
|
return bpf_mkobj_ops(dentry, mode, arg, &bpf_prog_iops,
|
|
|
|
&bpffs_obj_fops);
|
2017-12-01 17:22:19 -05:00
|
|
|
}
|
2015-10-29 14:58:09 +01:00
|
|
|
|
2017-12-01 17:22:19 -05:00
|
|
|
static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg)
|
|
|
|
{
|
2018-04-18 15:56:03 -07:00
|
|
|
struct bpf_map *map = arg;
|
|
|
|
|
|
|
|
return bpf_mkobj_ops(dentry, mode, arg, &bpf_map_iops,
|
2018-08-12 01:59:17 +02:00
|
|
|
bpf_map_support_seq_show(map) ?
|
|
|
|
&bpffs_map_fops : &bpffs_obj_fops);
|
2015-10-29 14:58:09 +01:00
|
|
|
}
|
|
|
|
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg)
|
|
|
|
{
|
2020-05-09 10:59:06 -07:00
|
|
|
struct bpf_link *link = arg;
|
|
|
|
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
return bpf_mkobj_ops(dentry, mode, arg, &bpf_link_iops,
|
2020-05-09 10:59:06 -07:00
|
|
|
bpf_link_is_iter(link) ?
|
|
|
|
&bpf_iter_fops : &bpffs_obj_fops);
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
}
|
|
|
|
|
2016-03-25 12:06:51 -04:00
|
|
|
static struct dentry *
|
|
|
|
bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags)
|
2015-12-10 22:33:49 +01:00
|
|
|
{
|
2018-03-08 23:46:33 -08:00
|
|
|
/* Dots in names (e.g. "/sys/fs/bpf/foo.bar") are reserved for future
|
2020-08-18 21:27:58 -07:00
|
|
|
* extensions. That allows popoulate_bpffs() create special files.
|
2018-03-08 23:46:33 -08:00
|
|
|
*/
|
2020-08-18 21:27:58 -07:00
|
|
|
if ((dir->i_mode & S_IALLUGO) &&
|
|
|
|
strchr(dentry->d_name.name, '.'))
|
2016-03-25 12:06:51 -04:00
|
|
|
return ERR_PTR(-EPERM);
|
bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 02:30:46 +02:00
|
|
|
|
2016-03-25 12:06:51 -04:00
|
|
|
return simple_lookup(dir, dentry, flags);
|
2015-12-10 22:33:49 +01:00
|
|
|
}
|
|
|
|
|
2023-01-13 12:49:14 +01:00
|
|
|
static int bpf_symlink(struct mnt_idmap *idmap, struct inode *dir,
|
2021-01-21 14:19:43 +01:00
|
|
|
struct dentry *dentry, const char *target)
|
bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 02:30:46 +02:00
|
|
|
{
|
|
|
|
char *link = kstrdup(target, GFP_USER | __GFP_NOWARN);
|
|
|
|
struct inode *inode;
|
|
|
|
|
|
|
|
if (!link)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
inode = bpf_get_inode(dir->i_sb, dir, S_IRWXUGO | S_IFLNK);
|
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
kfree(link);
|
|
|
|
return PTR_ERR(inode);
|
|
|
|
}
|
|
|
|
|
|
|
|
inode->i_op = &simple_symlink_inode_operations;
|
|
|
|
inode->i_link = link;
|
|
|
|
|
|
|
|
bpf_dentry_finalize(dentry, inode, dir);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-10-29 14:58:09 +01:00
|
|
|
static const struct inode_operations bpf_dir_iops = {
|
2016-03-25 12:06:51 -04:00
|
|
|
.lookup = bpf_lookup,
|
2015-10-29 14:58:09 +01:00
|
|
|
.mkdir = bpf_mkdir,
|
bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 02:30:46 +02:00
|
|
|
.symlink = bpf_symlink,
|
2015-10-29 14:58:09 +01:00
|
|
|
.rmdir = simple_rmdir,
|
2016-03-25 12:06:51 -04:00
|
|
|
.rename = simple_rename,
|
|
|
|
.link = simple_link,
|
2015-10-29 14:58:09 +01:00
|
|
|
.unlink = simple_unlink,
|
|
|
|
};
|
|
|
|
|
2020-08-18 21:27:58 -07:00
|
|
|
/* pin iterator link into bpffs */
|
|
|
|
static int bpf_iter_link_pin_kernel(struct dentry *parent,
|
|
|
|
const char *name, struct bpf_link *link)
|
|
|
|
{
|
|
|
|
umode_t mode = S_IFREG | S_IRUSR;
|
|
|
|
struct dentry *dentry;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
inode_lock(parent->d_inode);
|
|
|
|
dentry = lookup_one_len(name, parent, strlen(name));
|
|
|
|
if (IS_ERR(dentry)) {
|
|
|
|
inode_unlock(parent->d_inode);
|
|
|
|
return PTR_ERR(dentry);
|
|
|
|
}
|
|
|
|
ret = bpf_mkobj_ops(dentry, mode, link, &bpf_link_iops,
|
|
|
|
&bpf_iter_fops);
|
|
|
|
dput(dentry);
|
|
|
|
inode_unlock(parent->d_inode);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.
One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-15 16:48:06 -07:00
|
|
|
static int bpf_obj_do_pin(int path_fd, const char __user *pathname, void *raw,
|
2015-10-29 14:58:09 +01:00
|
|
|
enum bpf_type type)
|
|
|
|
{
|
|
|
|
struct dentry *dentry;
|
|
|
|
struct inode *dir;
|
|
|
|
struct path path;
|
|
|
|
umode_t mode;
|
|
|
|
int ret;
|
|
|
|
|
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.
One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-15 16:48:06 -07:00
|
|
|
dentry = user_path_create(path_fd, pathname, &path, 0);
|
2015-10-29 14:58:09 +01:00
|
|
|
if (IS_ERR(dentry))
|
|
|
|
return PTR_ERR(dentry);
|
|
|
|
|
|
|
|
dir = d_inode(path.dentry);
|
|
|
|
if (dir->i_op != &bpf_dir_iops) {
|
|
|
|
ret = -EPERM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2023-05-22 16:29:14 -07:00
|
|
|
mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
|
|
|
|
ret = security_path_mknod(&path, dentry, mode, 0);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
2017-12-01 17:22:19 -05:00
|
|
|
switch (type) {
|
|
|
|
case BPF_TYPE_PROG:
|
|
|
|
ret = vfs_mkobj(dentry, mode, bpf_mkprog, raw);
|
|
|
|
break;
|
|
|
|
case BPF_TYPE_MAP:
|
|
|
|
ret = vfs_mkobj(dentry, mode, bpf_mkmap, raw);
|
|
|
|
break;
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
case BPF_TYPE_LINK:
|
|
|
|
ret = vfs_mkobj(dentry, mode, bpf_mklink, raw);
|
|
|
|
break;
|
2017-12-01 17:22:19 -05:00
|
|
|
default:
|
|
|
|
ret = -EPERM;
|
|
|
|
}
|
2015-10-29 14:58:09 +01:00
|
|
|
out:
|
|
|
|
done_path_create(&path, dentry);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.
One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-15 16:48:06 -07:00
|
|
|
int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname)
|
2015-10-29 14:58:09 +01:00
|
|
|
{
|
|
|
|
enum bpf_type type;
|
|
|
|
void *raw;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
raw = bpf_fd_probe_obj(ufd, &type);
|
2020-01-20 23:28:58 +00:00
|
|
|
if (IS_ERR(raw))
|
|
|
|
return PTR_ERR(raw);
|
2015-10-29 14:58:09 +01:00
|
|
|
|
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.
One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-15 16:48:06 -07:00
|
|
|
ret = bpf_obj_do_pin(path_fd, pathname, raw, type);
|
2015-10-29 14:58:09 +01:00
|
|
|
if (ret != 0)
|
|
|
|
bpf_any_put(raw, type);
|
2020-01-20 23:28:58 +00:00
|
|
|
|
2015-10-29 14:58:09 +01:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.
One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-15 16:48:06 -07:00
|
|
|
static void *bpf_obj_do_get(int path_fd, const char __user *pathname,
|
2017-10-18 13:00:22 -07:00
|
|
|
enum bpf_type *type, int flags)
|
2015-10-29 14:58:09 +01:00
|
|
|
{
|
|
|
|
struct inode *inode;
|
|
|
|
struct path path;
|
|
|
|
void *raw;
|
|
|
|
int ret;
|
|
|
|
|
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.
One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-15 16:48:06 -07:00
|
|
|
ret = user_path_at(path_fd, pathname, LOOKUP_FOLLOW, &path);
|
2015-10-29 14:58:09 +01:00
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
|
|
|
|
inode = d_backing_inode(path.dentry);
|
2021-01-21 14:19:22 +01:00
|
|
|
ret = path_permission(&path, ACC_MODE(flags));
|
2015-10-29 14:58:09 +01:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = bpf_inode_type(inode, type);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
raw = bpf_any_get(inode->i_private, *type);
|
2016-04-27 18:56:20 -07:00
|
|
|
if (!IS_ERR(raw))
|
|
|
|
touch_atime(&path);
|
2015-10-29 14:58:09 +01:00
|
|
|
|
|
|
|
path_put(&path);
|
|
|
|
return raw;
|
|
|
|
out:
|
|
|
|
path_put(&path);
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
|
|
|
|
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.
One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-15 16:48:06 -07:00
|
|
|
int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags)
|
2015-10-29 14:58:09 +01:00
|
|
|
{
|
|
|
|
enum bpf_type type = BPF_TYPE_UNSPEC;
|
2017-10-18 13:00:22 -07:00
|
|
|
int f_flags;
|
2015-10-29 14:58:09 +01:00
|
|
|
void *raw;
|
2020-01-20 23:28:58 +00:00
|
|
|
int ret;
|
2015-10-29 14:58:09 +01:00
|
|
|
|
2017-10-18 13:00:22 -07:00
|
|
|
f_flags = bpf_get_file_flag(flags);
|
|
|
|
if (f_flags < 0)
|
|
|
|
return f_flags;
|
|
|
|
|
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.
One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.
This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.
This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-15 16:48:06 -07:00
|
|
|
raw = bpf_obj_do_get(path_fd, pathname, &type, f_flags);
|
2020-01-20 23:28:58 +00:00
|
|
|
if (IS_ERR(raw))
|
|
|
|
return PTR_ERR(raw);
|
2015-10-29 14:58:09 +01:00
|
|
|
|
|
|
|
if (type == BPF_TYPE_PROG)
|
2021-06-18 03:55:26 -07:00
|
|
|
ret = bpf_prog_new_fd(raw);
|
2015-10-29 14:58:09 +01:00
|
|
|
else if (type == BPF_TYPE_MAP)
|
2017-10-18 13:00:22 -07:00
|
|
|
ret = bpf_map_new_fd(raw, f_flags);
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
else if (type == BPF_TYPE_LINK)
|
2021-03-26 16:05:00 +00:00
|
|
|
ret = (f_flags != O_RDWR) ? -EINVAL : bpf_link_new_fd(raw);
|
2015-10-29 14:58:09 +01:00
|
|
|
else
|
2020-01-20 23:28:58 +00:00
|
|
|
return -ENOENT;
|
2015-10-29 14:58:09 +01:00
|
|
|
|
2018-04-28 19:56:37 -07:00
|
|
|
if (ret < 0)
|
2015-10-29 14:58:09 +01:00
|
|
|
bpf_any_put(raw, type);
|
|
|
|
return ret;
|
|
|
|
}
|
2017-12-02 20:20:38 -05:00
|
|
|
|
|
|
|
static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type type)
|
|
|
|
{
|
|
|
|
struct bpf_prog *prog;
|
2023-01-13 12:49:22 +01:00
|
|
|
int ret = inode_permission(&nop_mnt_idmap, inode, MAY_READ);
|
2017-12-02 20:20:38 -05:00
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
|
|
|
|
if (inode->i_op == &bpf_map_iops)
|
|
|
|
return ERR_PTR(-EINVAL);
|
bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.
Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.
Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 20:31:57 -08:00
|
|
|
if (inode->i_op == &bpf_link_iops)
|
|
|
|
return ERR_PTR(-EINVAL);
|
2017-12-02 20:20:38 -05:00
|
|
|
if (inode->i_op != &bpf_prog_iops)
|
|
|
|
return ERR_PTR(-EACCES);
|
|
|
|
|
|
|
|
prog = inode->i_private;
|
|
|
|
|
|
|
|
ret = security_bpf_prog(prog);
|
|
|
|
if (ret < 0)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
|
|
|
|
if (!bpf_prog_get_ok(prog, &type, false))
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
2019-11-17 09:28:03 -08:00
|
|
|
bpf_prog_inc(prog);
|
|
|
|
return prog;
|
2017-12-02 20:20:38 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type)
|
|
|
|
{
|
|
|
|
struct bpf_prog *prog;
|
|
|
|
struct path path;
|
|
|
|
int ret = kern_path(name, LOOKUP_FOLLOW, &path);
|
|
|
|
if (ret)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
prog = __get_prog_inode(d_backing_inode(path.dentry), type);
|
|
|
|
if (!IS_ERR(prog))
|
|
|
|
touch_atime(&path);
|
|
|
|
path_put(&path);
|
|
|
|
return prog;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(bpf_prog_get_type_path);
|
2015-10-29 14:58:09 +01:00
|
|
|
|
bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-23 18:21:16 -08:00
|
|
|
struct bpffs_btf_enums {
|
|
|
|
const struct btf *btf;
|
|
|
|
const struct btf_type *cmd_t;
|
|
|
|
const struct btf_type *map_t;
|
|
|
|
const struct btf_type *prog_t;
|
|
|
|
const struct btf_type *attach_t;
|
|
|
|
};
|
|
|
|
|
|
|
|
static int find_bpffs_btf_enums(struct bpffs_btf_enums *info)
|
|
|
|
{
|
|
|
|
const struct btf *btf;
|
|
|
|
const struct btf_type *t;
|
|
|
|
const char *name;
|
|
|
|
int i, n;
|
|
|
|
|
|
|
|
memset(info, 0, sizeof(*info));
|
|
|
|
|
|
|
|
btf = bpf_get_btf_vmlinux();
|
|
|
|
if (IS_ERR(btf))
|
|
|
|
return PTR_ERR(btf);
|
|
|
|
if (!btf)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
info->btf = btf;
|
|
|
|
|
|
|
|
for (i = 1, n = btf_nr_types(btf); i < n; i++) {
|
|
|
|
t = btf_type_by_id(btf, i);
|
|
|
|
if (!btf_type_is_enum(t))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
name = btf_name_by_offset(btf, t->name_off);
|
|
|
|
if (!name)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (strcmp(name, "bpf_cmd") == 0)
|
|
|
|
info->cmd_t = t;
|
|
|
|
else if (strcmp(name, "bpf_map_type") == 0)
|
|
|
|
info->map_t = t;
|
|
|
|
else if (strcmp(name, "bpf_prog_type") == 0)
|
|
|
|
info->prog_t = t;
|
|
|
|
else if (strcmp(name, "bpf_attach_type") == 0)
|
|
|
|
info->attach_t = t;
|
|
|
|
else
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (info->cmd_t && info->map_t && info->prog_t && info->attach_t)
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return -ESRCH;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool find_btf_enum_const(const struct btf *btf, const struct btf_type *enum_t,
|
|
|
|
const char *prefix, const char *str, int *value)
|
|
|
|
{
|
|
|
|
const struct btf_enum *e;
|
|
|
|
const char *name;
|
|
|
|
int i, n, pfx_len = strlen(prefix);
|
|
|
|
|
|
|
|
*value = 0;
|
|
|
|
|
|
|
|
if (!btf || !enum_t)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
for (i = 0, n = btf_vlen(enum_t); i < n; i++) {
|
|
|
|
e = &btf_enum(enum_t)[i];
|
|
|
|
|
|
|
|
name = btf_name_by_offset(btf, e->name_off);
|
|
|
|
if (!name || strncasecmp(name, prefix, pfx_len) != 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* match symbolic name case insensitive and ignoring prefix */
|
|
|
|
if (strcasecmp(name + pfx_len, str) == 0) {
|
|
|
|
*value = e->val;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void seq_print_delegate_opts(struct seq_file *m,
|
|
|
|
const char *opt_name,
|
|
|
|
const struct btf *btf,
|
|
|
|
const struct btf_type *enum_t,
|
|
|
|
const char *prefix,
|
|
|
|
u64 delegate_msk, u64 any_msk)
|
|
|
|
{
|
|
|
|
const struct btf_enum *e;
|
|
|
|
bool first = true;
|
|
|
|
const char *name;
|
|
|
|
u64 msk;
|
|
|
|
int i, n, pfx_len = strlen(prefix);
|
|
|
|
|
|
|
|
delegate_msk &= any_msk; /* clear unknown bits */
|
|
|
|
|
|
|
|
if (delegate_msk == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
seq_printf(m, ",%s", opt_name);
|
|
|
|
if (delegate_msk == any_msk) {
|
|
|
|
seq_printf(m, "=any");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (btf && enum_t) {
|
|
|
|
for (i = 0, n = btf_vlen(enum_t); i < n; i++) {
|
|
|
|
e = &btf_enum(enum_t)[i];
|
|
|
|
name = btf_name_by_offset(btf, e->name_off);
|
|
|
|
if (!name || strncasecmp(name, prefix, pfx_len) != 0)
|
|
|
|
continue;
|
|
|
|
msk = 1ULL << e->val;
|
|
|
|
if (delegate_msk & msk) {
|
|
|
|
/* emit lower-case name without prefix */
|
2024-07-15 11:12:30 +02:00
|
|
|
seq_putc(m, first ? '=' : ':');
|
bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-23 18:21:16 -08:00
|
|
|
name += pfx_len;
|
|
|
|
while (*name) {
|
2024-07-15 11:12:30 +02:00
|
|
|
seq_putc(m, tolower(*name));
|
bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-23 18:21:16 -08:00
|
|
|
name++;
|
|
|
|
}
|
|
|
|
|
|
|
|
delegate_msk &= ~msk;
|
|
|
|
first = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (delegate_msk)
|
|
|
|
seq_printf(m, "%c0x%llx", first ? '=' : ':', delegate_msk);
|
|
|
|
}
|
|
|
|
|
2017-07-05 16:24:49 +01:00
|
|
|
/*
|
|
|
|
* Display the mount options in /proc/mounts.
|
|
|
|
*/
|
|
|
|
static int bpf_show_options(struct seq_file *m, struct dentry *root)
|
|
|
|
{
|
2023-12-20 14:38:05 +01:00
|
|
|
struct inode *inode = d_inode(root);
|
|
|
|
umode_t mode = inode->i_mode & S_IALLUGO & ~S_ISVTX;
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
|
bpf: Introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token. Also creating BPF token in init user namespace is
currently not supported, given BPF token doesn't have any effect in init
user namespace anyways.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
2024-01-23 18:21:00 -08:00
|
|
|
u64 mask;
|
2023-12-20 14:38:05 +01:00
|
|
|
|
|
|
|
if (!uid_eq(inode->i_uid, GLOBAL_ROOT_UID))
|
|
|
|
seq_printf(m, ",uid=%u",
|
|
|
|
from_kuid_munged(&init_user_ns, inode->i_uid));
|
|
|
|
if (!gid_eq(inode->i_gid, GLOBAL_ROOT_GID))
|
|
|
|
seq_printf(m, ",gid=%u",
|
|
|
|
from_kgid_munged(&init_user_ns, inode->i_gid));
|
2017-07-05 16:24:49 +01:00
|
|
|
if (mode != S_IRWXUGO)
|
|
|
|
seq_printf(m, ",mode=%o", mode);
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
|
bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-23 18:21:16 -08:00
|
|
|
if (opts->delegate_cmds || opts->delegate_maps ||
|
|
|
|
opts->delegate_progs || opts->delegate_attachs) {
|
|
|
|
struct bpffs_btf_enums info;
|
|
|
|
|
|
|
|
/* ignore errors, fallback to hex */
|
|
|
|
(void)find_bpffs_btf_enums(&info);
|
|
|
|
|
|
|
|
mask = (1ULL << __MAX_BPF_CMD) - 1;
|
|
|
|
seq_print_delegate_opts(m, "delegate_cmds",
|
|
|
|
info.btf, info.cmd_t, "BPF_",
|
|
|
|
opts->delegate_cmds, mask);
|
|
|
|
|
|
|
|
mask = (1ULL << __MAX_BPF_MAP_TYPE) - 1;
|
|
|
|
seq_print_delegate_opts(m, "delegate_maps",
|
|
|
|
info.btf, info.map_t, "BPF_MAP_TYPE_",
|
|
|
|
opts->delegate_maps, mask);
|
|
|
|
|
|
|
|
mask = (1ULL << __MAX_BPF_PROG_TYPE) - 1;
|
|
|
|
seq_print_delegate_opts(m, "delegate_progs",
|
|
|
|
info.btf, info.prog_t, "BPF_PROG_TYPE_",
|
|
|
|
opts->delegate_progs, mask);
|
|
|
|
|
|
|
|
mask = (1ULL << __MAX_BPF_ATTACH_TYPE) - 1;
|
|
|
|
seq_print_delegate_opts(m, "delegate_attachs",
|
|
|
|
info.btf, info.attach_t, "BPF_",
|
|
|
|
opts->delegate_attachs, mask);
|
|
|
|
}
|
|
|
|
|
2017-07-05 16:24:49 +01:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-04-15 22:31:29 -04:00
|
|
|
static void bpf_free_inode(struct inode *inode)
|
2019-03-25 15:54:43 +01:00
|
|
|
{
|
|
|
|
enum bpf_type type;
|
|
|
|
|
|
|
|
if (S_ISLNK(inode->i_mode))
|
|
|
|
kfree(inode->i_link);
|
|
|
|
if (!bpf_inode_type(inode, &type))
|
|
|
|
bpf_any_put(inode->i_private, type);
|
|
|
|
free_inode_nonrcu(inode);
|
|
|
|
}
|
|
|
|
|
bpf: Introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token. Also creating BPF token in init user namespace is
currently not supported, given BPF token doesn't have any effect in init
user namespace anyways.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
2024-01-23 18:21:00 -08:00
|
|
|
const struct super_operations bpf_super_ops = {
|
2015-10-29 14:58:09 +01:00
|
|
|
.statfs = simple_statfs,
|
|
|
|
.drop_inode = generic_delete_inode,
|
2017-07-05 16:24:49 +01:00
|
|
|
.show_options = bpf_show_options,
|
2019-04-15 22:31:29 -04:00
|
|
|
.free_inode = bpf_free_inode,
|
2015-10-29 14:58:09 +01:00
|
|
|
};
|
|
|
|
|
2016-11-26 01:28:08 +01:00
|
|
|
enum {
|
2023-12-20 14:38:05 +01:00
|
|
|
OPT_UID,
|
|
|
|
OPT_GID,
|
2016-11-26 01:28:08 +01:00
|
|
|
OPT_MODE,
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
OPT_DELEGATE_CMDS,
|
|
|
|
OPT_DELEGATE_MAPS,
|
|
|
|
OPT_DELEGATE_PROGS,
|
|
|
|
OPT_DELEGATE_ATTACHS,
|
2016-11-26 01:28:08 +01:00
|
|
|
};
|
|
|
|
|
2019-09-07 07:23:15 -04:00
|
|
|
static const struct fs_parameter_spec bpf_fs_parameters[] = {
|
2023-12-20 14:38:05 +01:00
|
|
|
fsparam_u32 ("uid", OPT_UID),
|
|
|
|
fsparam_u32 ("gid", OPT_GID),
|
2019-03-22 14:58:36 +00:00
|
|
|
fsparam_u32oct ("mode", OPT_MODE),
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
fsparam_string ("delegate_cmds", OPT_DELEGATE_CMDS),
|
|
|
|
fsparam_string ("delegate_maps", OPT_DELEGATE_MAPS),
|
|
|
|
fsparam_string ("delegate_progs", OPT_DELEGATE_PROGS),
|
|
|
|
fsparam_string ("delegate_attachs", OPT_DELEGATE_ATTACHS),
|
2019-03-22 14:58:36 +00:00
|
|
|
{}
|
|
|
|
};
|
|
|
|
|
|
|
|
static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
|
2016-11-26 01:28:08 +01:00
|
|
|
{
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
struct bpf_mount_opts *opts = fc->s_fs_info;
|
2019-03-22 14:58:36 +00:00
|
|
|
struct fs_parse_result result;
|
2023-12-20 14:38:05 +01:00
|
|
|
kuid_t uid;
|
|
|
|
kgid_t gid;
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
int opt, err;
|
2016-11-26 01:28:08 +01:00
|
|
|
|
2019-09-07 07:23:15 -04:00
|
|
|
opt = fs_parse(fc, bpf_fs_parameters, param, &result);
|
2022-01-08 13:46:23 +00:00
|
|
|
if (opt < 0) {
|
2016-11-26 01:28:08 +01:00
|
|
|
/* We might like to report bad mount options here, but
|
|
|
|
* traditionally we've ignored all mount options, so we'd
|
|
|
|
* better continue to ignore non-existing options for bpf.
|
|
|
|
*/
|
2022-01-08 13:46:23 +00:00
|
|
|
if (opt == -ENOPARAM) {
|
|
|
|
opt = vfs_parse_fs_param_source(fc, param);
|
|
|
|
if (opt != -ENOPARAM)
|
|
|
|
return opt;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (opt < 0)
|
|
|
|
return opt;
|
|
|
|
}
|
2019-03-22 14:58:36 +00:00
|
|
|
|
|
|
|
switch (opt) {
|
2023-12-20 14:38:05 +01:00
|
|
|
case OPT_UID:
|
|
|
|
uid = make_kuid(current_user_ns(), result.uint_32);
|
|
|
|
if (!uid_valid(uid))
|
|
|
|
goto bad_value;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The requested uid must be representable in the
|
|
|
|
* filesystem's idmapping.
|
|
|
|
*/
|
|
|
|
if (!kuid_has_mapping(fc->user_ns, uid))
|
|
|
|
goto bad_value;
|
|
|
|
|
|
|
|
opts->uid = uid;
|
|
|
|
break;
|
|
|
|
case OPT_GID:
|
|
|
|
gid = make_kgid(current_user_ns(), result.uint_32);
|
|
|
|
if (!gid_valid(gid))
|
|
|
|
goto bad_value;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The requested gid must be representable in the
|
|
|
|
* filesystem's idmapping.
|
|
|
|
*/
|
|
|
|
if (!kgid_has_mapping(fc->user_ns, gid))
|
|
|
|
goto bad_value;
|
|
|
|
|
|
|
|
opts->gid = gid;
|
|
|
|
break;
|
2019-03-22 14:58:36 +00:00
|
|
|
case OPT_MODE:
|
|
|
|
opts->mode = result.uint_32 & S_IALLUGO;
|
|
|
|
break;
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
case OPT_DELEGATE_CMDS:
|
|
|
|
case OPT_DELEGATE_MAPS:
|
|
|
|
case OPT_DELEGATE_PROGS:
|
bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-23 18:21:16 -08:00
|
|
|
case OPT_DELEGATE_ATTACHS: {
|
|
|
|
struct bpffs_btf_enums info;
|
|
|
|
const struct btf_type *enum_t;
|
|
|
|
const char *enum_pfx;
|
|
|
|
u64 *delegate_msk, msk = 0;
|
2024-10-22 21:01:33 +08:00
|
|
|
char *p, *str;
|
bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-23 18:21:16 -08:00
|
|
|
int val;
|
|
|
|
|
|
|
|
/* ignore errors, fallback to hex */
|
|
|
|
(void)find_bpffs_btf_enums(&info);
|
|
|
|
|
|
|
|
switch (opt) {
|
|
|
|
case OPT_DELEGATE_CMDS:
|
|
|
|
delegate_msk = &opts->delegate_cmds;
|
|
|
|
enum_t = info.cmd_t;
|
|
|
|
enum_pfx = "BPF_";
|
|
|
|
break;
|
|
|
|
case OPT_DELEGATE_MAPS:
|
|
|
|
delegate_msk = &opts->delegate_maps;
|
|
|
|
enum_t = info.map_t;
|
|
|
|
enum_pfx = "BPF_MAP_TYPE_";
|
|
|
|
break;
|
|
|
|
case OPT_DELEGATE_PROGS:
|
|
|
|
delegate_msk = &opts->delegate_progs;
|
|
|
|
enum_t = info.prog_t;
|
|
|
|
enum_pfx = "BPF_PROG_TYPE_";
|
|
|
|
break;
|
|
|
|
case OPT_DELEGATE_ATTACHS:
|
|
|
|
delegate_msk = &opts->delegate_attachs;
|
|
|
|
enum_t = info.attach_t;
|
|
|
|
enum_pfx = "BPF_";
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
}
|
bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-23 18:21:16 -08:00
|
|
|
|
2024-10-22 21:01:33 +08:00
|
|
|
str = param->string;
|
|
|
|
while ((p = strsep(&str, ":"))) {
|
bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-23 18:21:16 -08:00
|
|
|
if (strcmp(p, "any") == 0) {
|
|
|
|
msk |= ~0ULL;
|
|
|
|
} else if (find_btf_enum_const(info.btf, enum_t, enum_pfx, p, &val)) {
|
|
|
|
msk |= 1ULL << val;
|
|
|
|
} else {
|
|
|
|
err = kstrtou64(p, 0, &msk);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
/* Setting delegation mount options requires privileges */
|
|
|
|
if (msk && !capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
bpf: Support symbolic BPF FS delegation mount options
Besides already supported special "any" value and hex bit mask, support
string-based parsing of delegation masks based on exact enumerator
names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
`enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
normalized to lower case in mount option output. So "PROG_LOAD",
"prog_load", and "MAP_create" are all valid values to specify for
delegate_cmds options, "array" is among supported for map types, etc.
Besides supporting string values, we also support multiple values
specified at the same time, using colon (':') separator.
There are corresponding changes on bpf_show_options side to use known
values to print them in human-readable format, falling back to hex mask
printing, if there are any unrecognized bits. This shouldn't be
necessary when enum BTF information is present, but in general we should
always be able to fall back to this even if kernel was built without BTF.
As mentioned, emitted symbolic names are normalized to be all lower case.
Example below shows various ways to specify delegate_cmds options
through mount command and how mount options are printed back:
12/14 14:39:07.604
vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
$ mount | rg token
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
$ mount | grep token
bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-23 18:21:16 -08:00
|
|
|
|
|
|
|
*delegate_msk |= msk;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default:
|
|
|
|
/* ignore unknown mount options */
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
break;
|
2016-11-26 01:28:08 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
2023-12-20 14:38:05 +01:00
|
|
|
bad_value:
|
|
|
|
return invalfc(fc, "Bad value for '%s'", param->key);
|
2016-11-26 01:28:08 +01:00
|
|
|
}
|
|
|
|
|
2020-08-18 21:27:58 -07:00
|
|
|
struct bpf_preload_ops *bpf_preload_ops;
|
|
|
|
EXPORT_SYMBOL_GPL(bpf_preload_ops);
|
|
|
|
|
|
|
|
static bool bpf_preload_mod_get(void)
|
|
|
|
{
|
|
|
|
/* If bpf_preload.ko wasn't loaded earlier then load it now.
|
|
|
|
* When bpf_preload is built into vmlinux the module's __init
|
|
|
|
* function will populate it.
|
|
|
|
*/
|
|
|
|
if (!bpf_preload_ops) {
|
|
|
|
request_module("bpf_preload");
|
|
|
|
if (!bpf_preload_ops)
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
/* And grab the reference, so the module doesn't disappear while the
|
|
|
|
* kernel is interacting with the kernel module and its UMD.
|
|
|
|
*/
|
|
|
|
if (!try_module_get(bpf_preload_ops->owner)) {
|
|
|
|
pr_err("bpf_preload module get failed.\n");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_preload_mod_put(void)
|
|
|
|
{
|
|
|
|
if (bpf_preload_ops)
|
|
|
|
/* now user can "rmmod bpf_preload" if necessary */
|
|
|
|
module_put(bpf_preload_ops->owner);
|
|
|
|
}
|
|
|
|
|
|
|
|
static DEFINE_MUTEX(bpf_preload_lock);
|
|
|
|
|
|
|
|
static int populate_bpffs(struct dentry *parent)
|
|
|
|
{
|
|
|
|
struct bpf_preload_info objs[BPF_PRELOAD_LINKS] = {};
|
|
|
|
int err = 0, i;
|
|
|
|
|
|
|
|
/* grab the mutex to make sure the kernel interactions with bpf_preload
|
2022-02-09 15:20:01 -08:00
|
|
|
* are serialized
|
2020-08-18 21:27:58 -07:00
|
|
|
*/
|
|
|
|
mutex_lock(&bpf_preload_lock);
|
|
|
|
|
|
|
|
/* if bpf_preload.ko wasn't built into vmlinux then load it */
|
|
|
|
if (!bpf_preload_mod_get())
|
|
|
|
goto out;
|
|
|
|
|
2022-02-09 15:20:01 -08:00
|
|
|
err = bpf_preload_ops->preload(objs);
|
|
|
|
if (err)
|
|
|
|
goto out_put;
|
|
|
|
for (i = 0; i < BPF_PRELOAD_LINKS; i++) {
|
|
|
|
bpf_link_inc(objs[i].link);
|
|
|
|
err = bpf_iter_link_pin_kernel(parent,
|
|
|
|
objs[i].link_name, objs[i].link);
|
|
|
|
if (err) {
|
|
|
|
bpf_link_put(objs[i].link);
|
2020-08-18 21:27:58 -07:00
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
out_put:
|
|
|
|
bpf_preload_mod_put();
|
|
|
|
out:
|
|
|
|
mutex_unlock(&bpf_preload_lock);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2019-03-22 14:58:36 +00:00
|
|
|
static int bpf_fill_super(struct super_block *sb, struct fs_context *fc)
|
2015-10-29 14:58:09 +01:00
|
|
|
{
|
2017-03-25 21:15:37 -07:00
|
|
|
static const struct tree_descr bpf_rfiles[] = { { "" } };
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
struct bpf_mount_opts *opts = sb->s_fs_info;
|
2015-10-29 14:58:09 +01:00
|
|
|
struct inode *inode;
|
|
|
|
int ret;
|
|
|
|
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
/* Mounting an instance of BPF FS requires privileges */
|
|
|
|
if (fc->user_ns != &init_user_ns && !capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2015-10-29 14:58:09 +01:00
|
|
|
ret = simple_fill_super(sb, BPF_FS_MAGIC, bpf_rfiles);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
sb->s_op = &bpf_super_ops;
|
|
|
|
|
|
|
|
inode = sb->s_root->d_inode;
|
2023-12-20 14:38:05 +01:00
|
|
|
inode->i_uid = opts->uid;
|
|
|
|
inode->i_gid = opts->gid;
|
2015-10-29 14:58:09 +01:00
|
|
|
inode->i_op = &bpf_dir_iops;
|
|
|
|
inode->i_mode &= ~S_IALLUGO;
|
2020-08-18 21:27:58 -07:00
|
|
|
populate_bpffs(sb->s_root);
|
2019-03-22 14:58:36 +00:00
|
|
|
inode->i_mode |= S_ISVTX | opts->mode;
|
2015-10-29 14:58:09 +01:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-03-22 14:58:36 +00:00
|
|
|
static int bpf_get_tree(struct fs_context *fc)
|
|
|
|
{
|
|
|
|
return get_tree_nodev(fc, bpf_fill_super);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_free_fc(struct fs_context *fc)
|
2015-10-29 14:58:09 +01:00
|
|
|
{
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
kfree(fc->s_fs_info);
|
2019-03-22 14:58:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static const struct fs_context_operations bpf_context_ops = {
|
|
|
|
.free = bpf_free_fc,
|
|
|
|
.parse_param = bpf_parse_param,
|
|
|
|
.get_tree = bpf_get_tree,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set up the filesystem mount context.
|
|
|
|
*/
|
|
|
|
static int bpf_init_fs_context(struct fs_context *fc)
|
|
|
|
{
|
|
|
|
struct bpf_mount_opts *opts;
|
|
|
|
|
|
|
|
opts = kzalloc(sizeof(struct bpf_mount_opts), GFP_KERNEL);
|
|
|
|
if (!opts)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
opts->mode = S_IRWXUGO;
|
2023-12-20 14:38:05 +01:00
|
|
|
opts->uid = current_fsuid();
|
|
|
|
opts->gid = current_fsgid();
|
bpf: add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-30 10:52:14 -08:00
|
|
|
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
/* start out with no BPF token delegation enabled */
|
|
|
|
opts->delegate_cmds = 0;
|
|
|
|
opts->delegate_maps = 0;
|
|
|
|
opts->delegate_progs = 0;
|
|
|
|
opts->delegate_attachs = 0;
|
|
|
|
|
|
|
|
fc->s_fs_info = opts;
|
2019-03-22 14:58:36 +00:00
|
|
|
fc->ops = &bpf_context_ops;
|
|
|
|
return 0;
|
2015-10-29 14:58:09 +01:00
|
|
|
}
|
|
|
|
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
static void bpf_kill_super(struct super_block *sb)
|
|
|
|
{
|
|
|
|
struct bpf_mount_opts *opts = sb->s_fs_info;
|
|
|
|
|
|
|
|
kill_litter_super(sb);
|
|
|
|
kfree(opts);
|
|
|
|
}
|
|
|
|
|
2015-10-29 14:58:09 +01:00
|
|
|
static struct file_system_type bpf_fs_type = {
|
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.name = "bpf",
|
2019-03-22 14:58:36 +00:00
|
|
|
.init_fs_context = bpf_init_fs_context,
|
2019-09-07 07:23:15 -04:00
|
|
|
.parameters = bpf_fs_parameters,
|
bpf: Add BPF token delegation mount options to BPF FS
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
- `delegate_cmds` allow to specify which bpf() syscall commands are
allowed with BPF token derived from this BPF FS instance;
- if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
a set of allowable BPF map types that could be created with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
a set of allowable BPF program types that could be loaded with BPF token;
- if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
a set of allowable BPF program attach types that could be loaded with
BPF token; delegate_progs and delegate_attachs are meant to be used
together, as full BPF program type is, in general, determined
through both program type and program attach type.
Currently, these mount options accept the following forms of values:
- a special value "any", that enables all possible values of a given
bit set;
- numeric value (decimal or hexadecimal, determined by kernel
automatically) that specifies a bit mask value directly;
- all the values for a given mount option are combined, if specified
multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
mask.
Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.
Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.
This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.
There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.
This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.
[0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-23 18:20:59 -08:00
|
|
|
.kill_sb = bpf_kill_super,
|
|
|
|
.fs_flags = FS_USERNS_MOUNT,
|
2015-10-29 14:58:09 +01:00
|
|
|
};
|
|
|
|
|
|
|
|
static int __init bpf_init(void)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = sysfs_create_mount_point(fs_kobj, "bpf");
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = register_filesystem(&bpf_fs_type);
|
|
|
|
if (ret)
|
|
|
|
sysfs_remove_mount_point(fs_kobj, "bpf");
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
fs_initcall(bpf_init);
|