These tests expose the issue of being unable to properly check for errors
returned from inlined bpf map helpers that make calls to the bpf_map_ops
functions. At best, a check for zero or non-zero can be done but these
tests show it is not possible to check for a negative value or for a
specific error value.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-2-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Teach the verifier to recognize PTR_TO_MEM | MEM_RDONLY as not NULL
otherwise if (!bpf_ksym_exists(known_kfunc)) doesn't go through
dead code elimination.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230321203854.3035-3-alexei.starovoitov@gmail.com
RELO_EXTERN_VAR/FUNC names are not correct anymore. RELO_EXTERN_VAR represent
ksym symbol in ld_imm64 insn. It can point to kernel variable or kfunc.
Rename RELO_EXTERN_VAR->RELO_EXTERN_LD64 and RELO_EXTERN_FUNC->RELO_EXTERN_CALL
to match what they actually represent.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230321203854.3035-2-alexei.starovoitov@gmail.com
Add a new test in copy-mode for testing the copying of metadata from the
buffer in kernel-space to user-space. This is accomplished by adding a
new XDP program and using the bss map to store a counter that is written
to the metadata field. This counter is incremented for every packet so
that the number becomes unique and should be the same as the payload. It
is store in the bss so the value can be reset between runs.
The XDP program populates the metadata and the userspace program checks
the value stored in the metadata field against the payload using the new
is_metadata_correct() function. To turn this verification on or off, add
a new parameter (use_metadata) to the ifobject structure.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230320102705.306187-1-tushar.vyavahare@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jakub Kicinski says:
====================
I'm trying to make more of the sk_buff bits optional.
Move the BPF-accessed bits a little - because they must
be at coding-time-constant offsets they must precede any
optional bit. While at it clean up the naming a bit.
v1: https://lore.kernel.org/all/20230308003159.441580-1-kuba@kernel.org/
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
To avoid more possible BPF dependencies with moving bitfields
around keep the fields BPF cares about right next to the offset
marker.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230321014115.997841-4-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
BPF needs to know the offsets of fields it tries to access.
Zero-length fields are added to make offsetof() work.
This unfortunately partitions the bitfield (fields across
the zero-length members can't be coalesced).
Reorder bytes 2 and 3, BPF needs to know the offset of fields
previously in byte 3 and some fields in byte 2 should really
be optional.
The two bytes are always in the same cacheline so it should
not matter.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230321014115.997841-3-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
vlan_present is gone since
commit 354259fa73e2 ("net: remove skb->vlan_present")
rename the offset field to what BPF is currently looking
for in this byte - mono_delivery_time and tc_at_ingress.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230321014115.997841-2-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Write data to fd by calling "vdprintf", in most implementations
of the standard library, the data is finally written by the writev syscall.
But "uprobe_events/kprobe_events" does not allow segmented writes,
so switch the "append_to_file" function to explicit write() call.
Signed-off-by: Liu Pan <patteliu@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230320030720.650-1-patteliu@gmail.com
Unlike normal libbpf the light skeleton 'loader' program is doing
btf_find_by_name_kind() call at run-time to find ksym in the kernel and
populate its {btf_id, btf_obj_fd} pair in ld_imm64 insn. To avoid doing the
search multiple times for the same ksym it remembers the first patched ld_imm64
insn and copies {btf_id, btf_obj_fd} from it into subsequent ld_imm64 insn.
Fix a bug in copying logic, since it may incorrectly clear BPF_PSEUDO_BTF_ID flag.
Also replace always true if (btf_obj_fd >= 0) check with unconditional JMP_JA
to clarify the code.
Fixes: d995816b77eb ("libbpf: Avoid reload of imm for weak, unresolved, repeating ksym")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230319203014.55866-1-alexei.starovoitov@gmail.com
This patch documents overview of libbpf, including its features for
developing BPF programs.
Signed-off-by: Sreevani Sreejith <ssreevani@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230315195405.2051559-1-ssreevani@meta.com
Currently, test_progs outputs all stdout/stderr as it runs, and when it
is done, prints a summary.
It is non-trivial for tooling to parse that output and extract meaningful
information from it.
This change adds a new option, `--json-summary`/`-J` that let the caller
specify a file where `test_progs{,-no_alu32}` can write a summary of the
run in a json format that can later be parsed by tooling.
Currently, it creates a summary section with successes/skipped/failures
followed by a list of failed tests and subtests.
A test contains the following fields:
- name: the name of the test
- number: the number of the test
- message: the log message that was printed by the test.
- failed: A boolean indicating whether the test failed or not. Currently
we only output failed tests, but in the future, successful tests could
be added.
- subtests: A list of subtests associated with this test.
A subtest contains the following fields:
- name: same as above
- number: sanme as above
- message: the log message that was printed by the subtest.
- failed: same as above but for the subtest
An example run and json content below:
```
$ sudo ./test_progs -a $(grep -v '^#' ./DENYLIST.aarch64 | awk '{print
$1","}' | tr -d '\n') -j -J /tmp/test_progs.json
$ jq < /tmp/test_progs.json | head -n 30
{
"success": 29,
"success_subtest": 23,
"skipped": 3,
"failed": 28,
"results": [
{
"name": "bpf_cookie",
"number": 10,
"message": "test_bpf_cookie:PASS:skel_open 0 nsec\n",
"failed": true,
"subtests": [
{
"name": "multi_kprobe_link_api",
"number": 2,
"message": "kprobe_multi_link_api_subtest:PASS:load_kallsyms 0 nsec\nlibbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n",
"failed": true
},
{
"name": "multi_kprobe_attach_api",
"number": 3,
"message": "libbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_attach_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n",
"failed": true
},
{
"name": "lsm",
"number": 8,
"message": "lsm_subtest:PASS:lsm.link_create 0 nsec\nlsm_subtest:FAIL:stack_mprotect unexpected stack_mprotect: actual 0 != expected -1\n",
"failed": true
}
```
The file can then be used to print a summary of the test run and list of
failing tests/subtests:
```
$ jq -r < /tmp/test_progs.json '"Success: \(.success)/\(.success_subtest), Skipped: \(.skipped), Failed: \(.failed)"'
Success: 29/23, Skipped: 3, Failed: 28
$ jq -r < /tmp/test_progs.json '.results | map([
if .failed then "#\(.number) \(.name)" else empty end,
(
. as {name: $tname, number: $tnum} | .subtests | map(
if .failed then "#\($tnum)/\(.number) \($tname)/\(.name)" else empty end
)
)
]) | flatten | .[]' | head -n 20
#10 bpf_cookie
#10/2 bpf_cookie/multi_kprobe_link_api
#10/3 bpf_cookie/multi_kprobe_attach_api
#10/8 bpf_cookie/lsm
#15 bpf_mod_race
#15/1 bpf_mod_race/ksym (used_btfs UAF)
#15/2 bpf_mod_race/kfunc (kfunc_btf_tab UAF)
#36 cgroup_hierarchical_stats
#61 deny_namespace
#61/1 deny_namespace/unpriv_userns_create_no_bpf
#73 fexit_stress
#83 get_func_ip_test
#99 kfunc_dynptr_param
#99/1 kfunc_dynptr_param/dynptr_data_null
#99/4 kfunc_dynptr_param/dynptr_data_null
#100 kprobe_multi_bench_attach
#100/1 kprobe_multi_bench_attach/kernel
#100/2 kprobe_multi_bench_attach/modules
#101 kprobe_multi_test
#101/1 kprobe_multi_test/skel_api
```
Signed-off-by: Manu Bretelle <chantr4@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230317163256.3809328-1-chantr4@gmail.com
Alexei Starovoitov says:
====================
From: Alexei Starovoitov <ast@kernel.org>
Allow BPF programs detect at load time whether particular kfunc exists.
Patch 1: Allow ld_imm64 to point to kfunc in the kernel.
Patch 2: Fix relocation of kfunc in ld_imm64 insn when kfunc is in kernel module.
Patch 3: Introduce bpf_ksym_exists() macro.
Patch 4: selftest.
NOTE: detection of kfuncs from light skeleton is not supported yet.
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add load and run time test for bpf_ksym_exists() and check that the verifier
performs dead code elimination for non-existing kfunc.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230317201920.62030-5-alexei.starovoitov@gmail.com
Introduce bpf_ksym_exists() macro that can be used by BPF programs
to detect at load time whether particular ksym (either variable or kfunc)
is present in the kernel.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230317201920.62030-4-alexei.starovoitov@gmail.com
void *p = kfunc; -> generates ld_imm64 insn.
kfunc() -> generates bpf_call insn.
libbpf patches bpf_call insn correctly while only btf_id part of ld_imm64 is
set in the former case. Which means that pointers to kfuncs in modules are not
patched correctly and the verifier rejects load of such programs due to btf_id
being out of range. Fix libbpf to patch ld_imm64 for kfunc.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230317201920.62030-3-alexei.starovoitov@gmail.com
Allow ld_imm64 insn with BPF_PSEUDO_BTF_ID to hold the address of kfunc. The
ld_imm64 pointing to a valid kfunc will be seen as non-null PTR_TO_MEM by
is_branch_taken() logic of the verifier, while libbpf will resolve address to
unknown kfunc as ld_imm64 reg, 0 which will also be recognized by
is_branch_taken() and the verifier will proceed dead code elimination. BPF
programs can use this logic to detect at load time whether kfunc is present in
the kernel with bpf_ksym_exists() macro that is introduced in the next patches.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230317201920.62030-2-alexei.starovoitov@gmail.com
Commit d56b0c461d19da ("bpf, docs: Fix link to netdev-FAQ target")
attempts to fix linking problem to undefined "netdev-FAQ" label
introduced in 287f4fa99a5281 ("docs: Update references to netdev-FAQ")
by changing internal cross reference to netdev subsystem documentation
(Documentation/process/maintainer-netdev.rst) to external one at
docs.kernel.org. However, the linking problem is still not
resolved, as the generated link points to non-existent netdev-FAQ
section of the external doc, which when clicked, will instead going
to the top of the doc.
Revert back to internal linking by simply mention the doc path while
massaging the leading text to the link, since the netdev subsystem
doc contains no FAQs but rather general information about the subsystem.
Fixes: d56b0c461d19 ("bpf, docs: Fix link to netdev-FAQ target")
Fixes: 287f4fa99a52 ("docs: Update references to netdev-FAQ")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230314074449.23620-1-bagasdotme@gmail.com
Moving find_kallsyms_symbol_value from kernel/module/internal.h to
include/linux/module.h. The reason is that internal.h is not prepared to
be included when CONFIG_MODULES=n. find_kallsyms_symbol_value is used by
kernel/bpf/verifier.c and including internal.h from it (without modules)
leads into a compilation error:
In file included from ../include/linux/container_of.h:5,
from ../include/linux/list.h:5,
from ../include/linux/timer.h:5,
from ../include/linux/workqueue.h:9,
from ../include/linux/bpf.h:10,
from ../include/linux/bpf-cgroup.h:5,
from ../kernel/bpf/verifier.c:7:
../kernel/bpf/../module/internal.h: In function 'mod_find':
../include/linux/container_of.h:20:54: error: invalid use of undefined type 'struct module'
20 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~
[...]
This patch fixes the above error.
Fixes: 31bf1dbccfb0 ("bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/oe-kbuild-all/202303161404.OrmfCy09-lkp@intel.com/
Link: https://lore.kernel.org/bpf/20230317095601.386738-1-vmalik@redhat.com
Alexander Lobakin says:
====================
Enabling skb PP recycling revealed a couple issues in the bpf_test_run
code. Recycling broke the assumption that the headroom won't ever be
touched during the test_run execution: xdp_scrub_frame() invalidates the
XDP frame at the headroom start, while neigh xmit code overwrites 2 bytes
to the left of the Ethernet header. The first makes the kernel panic in
certain cases, while the second breaks xdp_do_redirect selftest on BE.
test_run is a limited-scope entity, so let's hope no more corner cases
will happen here or at least they will be as easy and pleasant to fix
as those two.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexei noticed xdp_do_redirect test on BPF CI started failing on
BE systems after skb PP recycling was enabled:
test_xdp_do_redirect:PASS:prog_run 0 nsec
test_xdp_do_redirect:PASS:pkt_count_xdp 0 nsec
test_xdp_do_redirect:PASS:pkt_count_zero 0 nsec
test_xdp_do_redirect:FAIL:pkt_count_tc unexpected pkt_count_tc: actual
220 != expected 9998
test_max_pkt_size:PASS:prog_run_max_size 0 nsec
test_max_pkt_size:PASS:prog_run_too_big 0 nsec
close_netns:PASS:setns 0 nsec
#289 xdp_do_redirect:FAIL
Summary: 270/1674 PASSED, 30 SKIPPED, 1 FAILED
and it doesn't happen on LE systems.
Ilya then hunted it down to:
#0 0x0000000000aaeee6 in neigh_hh_output (hh=0x83258df0,
skb=0x88142200) at linux/include/net/neighbour.h:503
#1 0x0000000000ab2cda in neigh_output (skip_cache=false,
skb=0x88142200, n=<optimized out>) at linux/include/net/neighbour.h:544
#2 ip6_finish_output2 (net=net@entry=0x88edba00, sk=sk@entry=0x0,
skb=skb@entry=0x88142200) at linux/net/ipv6/ip6_output.c:134
#3 0x0000000000ab4cbc in __ip6_finish_output (skb=0x88142200, sk=0x0,
net=0x88edba00) at linux/net/ipv6/ip6_output.c:195
#4 ip6_finish_output (net=0x88edba00, sk=0x0, skb=0x88142200) at
linux/net/ipv6/ip6_output.c:206
xdp_do_redirect test places a u32 marker (0x42) right before the Ethernet
header to check it then in the XDP program and return %XDP_ABORTED if it's
not there. Neigh xmit code likes to round up hard header length to speed
up copying the header, so it overwrites two bytes in front of the Eth
header. On LE systems, 0x42 is one byte at `data - 4`, while on BE it's
`data - 1`, what explains why it happens only there.
It didn't happen previously due to that %XDP_PASS meant the page will be
discarded and replaced by a new one, but now it can be recycled as well,
while bpf_test_run code doesn't reinitialize the content of recycled
pages. This mark is limited to this particular test and its setup though,
so there's no need to predict 1000 different possible cases. Just move
it 4 bytes to the left, still keeping it 32 bit to match on more bytes.
Fixes: 9c94bbf9a87b ("xdp: recycle Page Pool backed skbs built from XDP frames")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/CAADnVQ+B_JOU+EpP=DKhbY9yXdN6GiRPnpTTXfEZ9sNkUeb-yQ@mail.gmail.com
Reported-by: Ilya Leoshkevich <iii@linux.ibm.com> # + debugging
Link: https://lore.kernel.org/bpf/8341c1d9f935f410438e79d3bd8a9cc50aefe105.camel@linux.ibm.com
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230316175051.922550-3-aleksander.lobakin@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
syzbot and Ilya faced the splats when %XDP_PASS happens for bpf_test_run
after skb PP recycling was enabled for {__,}xdp_build_skb_from_frame():
BUG: kernel NULL pointer dereference, address: 0000000000000d28
RIP: 0010:memset_erms+0xd/0x20 arch/x86/lib/memset_64.S:66
[...]
Call Trace:
<TASK>
__finalize_skb_around net/core/skbuff.c:321 [inline]
__build_skb_around+0x232/0x3a0 net/core/skbuff.c:379
build_skb_around+0x32/0x290 net/core/skbuff.c:444
__xdp_build_skb_from_frame+0x121/0x760 net/core/xdp.c:622
xdp_recv_frames net/bpf/test_run.c:248 [inline]
xdp_test_run_batch net/bpf/test_run.c:334 [inline]
bpf_test_run_xdp_live+0x1289/0x1930 net/bpf/test_run.c:362
bpf_prog_test_run_xdp+0xa05/0x14e0 net/bpf/test_run.c:1418
[...]
This happens due to that it calls xdp_scrub_frame(), which nullifies
xdpf->data. bpf_test_run code doesn't reinit the frame when the XDP
program doesn't adjust head or tail. Previously, %XDP_PASS meant the
page will be released from the pool and returned to the MM layer, but
now it does return to the Pool with the nullified xdpf->data, which
doesn't get reinitialized then.
So, in addition to checking whether the head and/or tail have been
adjusted, check also for a potential XDP frame corruption. xdpf->data
is 100% affected and also xdpf->flags is the field closest to the
metadata / frame start. Checking for these two should be enough for
non-extreme cases.
Fixes: 9c94bbf9a87b ("xdp: recycle Page Pool backed skbs built from XDP frames")
Reported-by: syzbot+e1d1b65f7c32f2a86a9f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/bpf/000000000000f1985705f6ef2243@google.com
Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/bpf/e07dd94022ad5731705891b9487cc9ed66328b94.camel@linux.ibm.com
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230316175051.922550-2-aleksander.lobakin@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For every BPF_ADD/SUB involving a pointer, adjust_ptr_min_max_vals()
ensures that the resulting pointer has a constant offset if
bypass_spec_v1 is false. This is ensured by calling sanitize_check_bounds()
which in turn calls check_stack_access_for_ptr_arithmetic(). There,
-EACCESS is returned if the register's offset is not constant, thereby
rejecting the program.
In summary, an unprivileged user must never be able to create stack
pointers with a variable offset. That is also the case, because a
respective check in check_stack_write() is missing. If they were able
to create a variable-offset pointer, users could still use it in a
stack-write operation to trigger unsafe speculative behavior [1].
Because unprivileged users must already be prevented from creating
variable-offset stack pointers, viable options are to either remove
this check (replacing it with a clarifying comment), or to turn it
into a "verifier BUG"-message, also adding a similar check in
check_stack_write() (for consistency, as a second-level defense).
This patch implements the first option to reduce verifier bloat.
This check was introduced by commit 01f810ace9ed ("bpf: Allow
variable-offset stack access") which correctly notes that
"variable-offset reads and writes are disallowed (they were already
disallowed for the indirect access case) because the speculative
execution checking code doesn't support them". However, it does not
further discuss why the check in check_stack_read() is necessary.
The code which made this check obsolete was also introduced in this
commit.
I have compiled ~650 programs from the Linux selftests, Linux samples,
Cilium, and libbpf/examples projects and confirmed that none of these
trigger the check in check_stack_read() [2]. Instead, all of these
programs are, as expected, already rejected when constructing the
variable-offset pointers. Note that the check in
check_stack_access_for_ptr_arithmetic() also prints "off=%d" while the
code removed by this patch does not (the error removed does not appear
in the "verification_error" values). For reproducibility, the
repository linked includes the raw data and scripts used to create
the plot.
[1] https://arxiv.org/pdf/1807.03757.pdf
[2] 53dc19fcf4/data/plots/23-02-26_23-56_bpftool/bpftool/0004-errors.pdf
Fixes: 01f810ace9ed ("bpf: Allow variable-offset stack access")
Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230315165358.23701-1-gerhorst@cs.fau.de
David Vernet says:
====================
The struct bpf_cpumask type is currently not RCU safe. It uses the
bpf_mem_cache_{alloc,free}() APIs to allocate and release cpumasks, and
those allocations may be reused before an RCU grace period has elapsed.
We want to be able to enable using this pattern in BPF programs:
private(MASK) static struct bpf_cpumask __kptr *global;
int BPF_PROG(prog, ...)
{
struct bpf_cpumask *cpumask;
bpf_rcu_read_lock();
cpumask = global;
if (!cpumask) {
bpf_rcu_read_unlock();
return -1;
}
bpf_cpumask_setall(cpumask);
...
bpf_rcu_read_unlock();
}
In other words, to be able to pass a kptr to KF_RCU bpf_cpumask kfuncs
without requiring the acquisition and release of refcounts using
bpf_cpumask_kptr_get(). This patchset enables this by making the struct
bpf_cpumask type RCU safe, and removing the bpf_cpumask_kptr_get()
function.
---
v1: https://lore.kernel.org/all/20230316014122.678082-2-void@manifault.com/
Changelog:
----------
v1 -> v2:
- Add doxygen comment for new @rcu field in struct bpf_cpumask.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Now that the kfunc no longer exists, we can remove it and instead
describe how RCU can be used to get a struct bpf_cpumask from a map
value. This patch updates the BPF documentation accordingly.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230316054028.88924-6-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Now that struct bpf_cpumask is RCU safe, there's no need for this kfunc.
Rather than doing the following:
private(MASK) static struct bpf_cpumask __kptr *global;
int BPF_PROG(prog, s32 cpu, ...)
{
struct bpf_cpumask *cpumask;
bpf_rcu_read_lock();
cpumask = bpf_cpumask_kptr_get(&global);
if (!cpumask) {
bpf_rcu_read_unlock();
return -1;
}
bpf_cpumask_setall(cpumask);
...
bpf_cpumask_release(cpumask);
bpf_rcu_read_unlock();
}
Programs can instead simply do (assume same global cpumask):
int BPF_PROG(prog, ...)
{
struct bpf_cpumask *cpumask;
bpf_rcu_read_lock();
cpumask = global;
if (!cpumask) {
bpf_rcu_read_unlock();
return -1;
}
bpf_cpumask_setall(cpumask);
...
bpf_rcu_read_unlock();
}
In other words, no extra atomic acquire / release, and less boilerplate
code.
This patch removes both the kfunc, as well as its selftests and
documentation.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230316054028.88924-5-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Now that struct bpf_cpumask * is considered an RCU-safe type according
to the verifier, we should add tests that validate its common usages.
This patch adds those tests to the cpumask test suite. A subsequent
changes will remove bpf_cpumask_kptr_get(), and will adjust the selftest
and BPF documentation accordingly.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230316054028.88924-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
struct bpf_cpumask is a BPF-wrapper around the struct cpumask type which
can be instantiated by a BPF program, and then queried as a cpumask in
similar fashion to normal kernel code. The previous patch in this series
makes the type fully RCU safe, so the type can be included in the
rcu_protected_type BTF ID list.
A subsequent patch will remove bpf_cpumask_kptr_get(), as it's no longer
useful now that we can just treat the type as RCU safe by default and do
our own if check.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230316054028.88924-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The struct bpf_cpumask type uses the bpf_mem_cache_{alloc,free}() APIs
to allocate and free its cpumasks. The bpf_mem allocator may currently
immediately reuse some memory when its freed, without waiting for an RCU
read cycle to elapse. We want to be able to treat struct bpf_cpumask
objects as completely RCU safe.
This is necessary for two reasons:
1. bpf_cpumask_kptr_get() currently does an RCU-protected
refcnt_inc_not_zero(). This of course assumes that the underlying
memory is not reused, and is therefore unsafe in its current form.
2. We want to be able to get rid of bpf_cpumask_kptr_get() entirely, and
intead use the superior kptr RCU semantics now afforded by the
verifier.
This patch fixes (1), and enables (2), by making struct bpf_cpumask RCU
safe. A subsequent patch will update the verifier to allow struct
bpf_cpumask * pointers to be passed to KF_RCU kfuncs, and then a latter
patch will remove bpf_cpumask_kptr_get().
Fixes: 516f4d3397c9 ("bpf: Enable cpumasks to be queried and used as kptrs")
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230316054028.88924-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some consumers of libbpf compile the code base with different warnings
enabled. In a report for perf, for example, -Wpacked was set which
caused warnings about "inefficient alignment" to be emitted on a subset
of supported architectures.
With this change we silence specifically those warnings, as we intentionally
worked with packed structs.
This is a similar resolution as in b2f10cd4e805 ("perf cpumap: Fix alignment
for masks in event encoding").
Fixes: 1eebcb60633f ("libbpf: Implement basic zip archive parsing support")
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/bpf/CA+G9fYtBnwxAWXi2+GyNByApxnf_DtP1-6+_zOKAdJKnJBexjg@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20230315171550.1551603-1-deso@posteo.net
In __start_server, it leaks a fd when setsockopt(SO_REUSEPORT) fails.
This patch fixes it.
Fixes: eed92afdd14c ("bpf: selftest: Test batching and bpf_(get|set)sockopt in bpf tcp iter")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230316000726.1016773-2-martin.lau@linux.dev
In tcp_hdr_options test, it ensures the received tcp hdr option
and the sk local storage have the expected values. It uses memcmp
to check that. Testing the memcmp result with ASSERT_OK is confusing
because ASSERT_OK will print out the errno which is not set.
This patch uses ASSERT_EQ to check for 0 instead.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230316000726.1016773-1-martin.lau@linux.dev
Viktor Malik says:
====================
I noticed that the verifier behaves incorrectly when attaching to fentry
of multiple functions of the same name located in different modules (or
in vmlinux). The reason for this is that if the target program is not
specified, the verifier will search kallsyms for the trampoline address
to attach to. The entire kallsyms is always searched, not respecting the
module in which the function to attach to is located.
As Yonghong correctly pointed out, there is yet another issue - the
trampoline acquires the module reference in register_fentry which means
that if the module is unloaded between the place where the address is
found in the verifier and register_fentry, it is possible that another
module is loaded to the same address in the meantime, which may lead to
errors.
This patch fixes the above issues by extracting the module name from the
BTF of the attachment target (which must be specified) and by doing the
search in kallsyms of the correct module. At the same time, the module
reference is acquired right after the address is found and only released
right before the program itself is unloaded.
---
Changes in v10:
- added the new test to DENYLIST.aarch64 (suggested by Andrii)
- renamed the test source file to match the test name
Changes in v9:
- two small changes suggested by Jiri Olsa and Jiri's ack
Changes in v8:
- added module_put to error paths in bpf_check_attach_target after the
module reference is acquired
Changes in v7:
- refactored the module reference manipulation (comments by Jiri Olsa)
- cleaned up the test (comments by Andrii Nakryiko)
Changes in v6:
- storing the module reference inside bpf_prog_aux instead of
bpf_trampoline and releasing it when the program is unloaded
(suggested by Jiri Olsa)
Changes in v5:
- fixed acquiring and releasing of module references by trampolines to
prevent modules being unloaded between address lookup and trampoline
allocation
Changes in v4:
- reworked module kallsyms lookup approach using existing functions,
verifier now calls btf_try_get_module to retrieve the module and
find_kallsyms_symbol_value to get the symbol address (suggested by
Alexei)
- included Jiri Olsa's comments
- improved description of the new test and added it as a comment into
the test source
Changes in v3:
- added trivial implementation for kallsyms_lookup_name_in_module() for
!CONFIG_MODULES (noticed by test robot, fix suggested by Hao Luo)
Changes in v2:
- introduced and used more space-efficient kallsyms lookup function,
suggested by Jiri Olsa
- included Hao Luo's comments
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adds a new test that tries to attach a program to fentry of two
functions of the same name, one located in vmlinux and the other in
bpf_testmod.
To avoid conflicts with existing tests, a new function
"bpf_fentry_shadow_test" was created both in vmlinux and in bpf_testmod.
The previous commit fixed a bug which caused this test to fail. The
verifier would always use the vmlinux function's address as the target
trampoline address, hence trying to create two trampolines for a single
address, which is forbidden.
The test (similarly to other fentry/fexit tests) is not working on arm64
at the moment.
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/5fe2f364190b6f79b085066ed7c5989c5bc475fa.1678432753.git.vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This resolves two problems with attachment of fentry/fexit/fmod_ret/lsm
to functions located in modules:
1. The verifier tries to find the address to attach to in kallsyms. This
is always done by searching the entire kallsyms, not respecting the
module in which the function is located. Such approach causes an
incorrect attachment address to be computed if the function to attach
to is shadowed by a function of the same name located earlier in
kallsyms.
2. If the address to attach to is located in a module, the module
reference is only acquired in register_fentry. If the module is
unloaded between the place where the address is found
(bpf_check_attach_target in the verifier) and register_fentry, it is
possible that another module is loaded to the same address which may
lead to potential errors.
Since the attachment must contain the BTF of the program to attach to,
we extract the module from it and search for the function address in the
correct module (resolving problem no. 1). Then, the module reference is
taken directly in bpf_check_attach_target and stored in the bpf program
(in bpf_prog_aux). The reference is only released when the program is
unloaded (resolving problem no. 2).
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/3f6a9d8ae850532b5ef864ef16327b0f7a669063.1678432753.git.vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The commit 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc") added
bpf_cgroup_from_id() which calls current_cgns_cgroup_dfl() through
cgroup_get_from_id(). However, BPF programs may be attached to a point where
current->nsproxy has already been cleared to NULL by exit_task_namespace()
and calling bpf_cgroup_from_id() would cause an oops.
Just return the system-wide root if nsproxy has been cleared. This allows
all cgroups to be looked up after the task passed through
exit_task_namespace(), which semantically makes sense. Given that the only
way to get this behavior is through BPF programs, it seems safe but let's
see what others think.
Fixes: 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc")
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/ZBDuVWiFj2jiz3i8@slm.duckdns.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
LLVM commit https://reviews.llvm.org/D143726 introduced hoistMinMax optimization
that transformed
(i < VIRTIO_MAX_SGS) && (i < out_sgs)
into
i < MIN(VIRTIO_MAX_SGS, out_sgs)
and caused the verifier to stop recognizing such loop as bounded.
Which resulted in the following test failure:
libbpf: prog 'trace_virtqueue_add_sgs': BPF program load failed: Bad address
libbpf: prog 'trace_virtqueue_add_sgs': -- BEGIN PROG LOAD LOG --
The sequence of 8193 jumps is too complex.
verification time 789206 usec
stack depth 56
processed 156446 insns (limit 1000000) max_states_per_insn 7 total_states 1746 peak_states 1701 mark_read 12
-- END PROG LOAD LOG --
libbpf: prog 'trace_virtqueue_add_sgs': failed to load: -14
libbpf: failed to load object 'loop6.bpf.o'
Workaround the verifier limitation for now with inline asm that
prevents this particular optimization.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Alexander Lobakin says:
====================
Yeah, I still remember that "Who needs cpumap nowadays" (c), but anyway.
__xdp_build_skb_from_frame() missed the moment when the networking stack
became able to recycle skb pages backed by a page_pool. This was making
e.g. cpumap redirect even less effective than simple %XDP_PASS. veth was
also affected in some scenarios.
A lot of drivers use skb_mark_for_recycle() already, it's been almost
two years and seems like there are no issues in using it in the generic
code too. {__,}xdp_release_frame() can be then removed as it losts its
last user.
Page Pool becomes then zero-alloc (or almost) in the abovementioned
cases, too. Other memory type models (who needs them at this point)
have no changes.
Some numbers on 1 Xeon Platinum core bombed with 27 Mpps of 64-byte
IPv6 UDP, iavf w/XDP[0] (CONFIG_PAGE_POOL_STATS is enabled):
Plain %XDP_PASS on baseline, Page Pool driver:
src cpu Rx drops dst cpu Rx
2.1 Mpps N/A 2.1 Mpps
cpumap redirect (cross-core, w/o leaving its NUMA node) on baseline:
6.8 Mpps 5.0 Mpps 1.8 Mpps
cpumap redirect with skb PP recycling:
7.9 Mpps 5.7 Mpps 2.2 Mpps
+22% (from cpumap redir on baseline)
[0] https://github.com/alobakin/linux/commits/iavf-xdp
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__xdp_build_skb_from_frame() was the last user of
{__,}xdp_release_frame(), which detaches pages from the page_pool.
All the consumers now recycle Page Pool skbs and page, except mlx5,
stmmac and tsnep drivers, which use page_pool_release_page() directly
(might change one day). It's safe to assume this functionality is not
needed anymore and can be removed (in favor of recycling).
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230313215553.1045175-5-aleksander.lobakin@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__xdp_build_skb_from_frame() state(d):
/* Until page_pool get SKB return path, release DMA here */
Page Pool got skb pages recycling in April 2021, but missed this
function.
xdp_release_frame() is relevant only for Page Pool backed frames and it
detaches the page from the corresponding page_pool in order to make it
freeable via page_frag_free(). It can instead just mark the output skb
as eligible for recycling if the frame is backed by a pp. No change for
other memory model types (the same condition check as before).
cpumap redirect and veth on Page Pool drivers now become zero-alloc (or
almost).
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230313215553.1045175-4-aleksander.lobakin@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
skb_mark_for_recycle() is guarded with CONFIG_PAGE_POOL, this creates
unneeded complication when using it in the generic code. For now, it's
only used in the drivers always selecting Page Pool, so this works.
Move the guards so that preprocessor will cut out only the operation
itself and the function will still be a noop on !PAGE_POOL systems,
but available there as well.
No functional changes.
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202303020342.Wi2PRFFH-lkp@intel.com
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230313215553.1045175-3-aleksander.lobakin@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, the test relies on that only dropped ("xmitted") frames will
be recycled and if a frame became an skb, it will be freed later by the
stack and never come back to its page_pool.
So, it easily gets broken by trying to recycle skbs[0]:
test_xdp_do_redirect:PASS:pkt_count_xdp 0 nsec
test_xdp_do_redirect:FAIL:pkt_count_zero unexpected pkt_count_zero:
actual 9936 != expected 2
test_xdp_do_redirect:PASS:pkt_count_tc 0 nsec
That huge mismatch happened because after the TC ingress hook zeroes the
magic, the page gets recycled when skb is freed, not returned to the MM
layer. "Live frames" mode initializes only new pages and keeps the
recycled ones as is by design, so they appear with zeroed magic on the
Rx path again.
Expand the possible magic values from two: 0 (was "xmitted"/dropped or
did hit the TC hook) and 0x42 (hit the input XDP prog) to three: the new
one will mark frames hit the TC hook, so that they will elide both
@pkt_count_zero and @pkt_count_xdp. They can then be recycled to their
page_pool or returned to the page allocator, this won't affect the
counters anyhow. Just make sure to mark them as "input" (0x42) when they
appear on the Rx path again.
Also make an enum from those magics, so that they will be always visible
and can be changed in just one place anytime. This also eases adding any
new marks later on.
Link: https://github.com/kernel-patches/bpf/actions/runs/4386538411/jobs/7681081789
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230313215553.1045175-2-aleksander.lobakin@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The verifier rejects the code:
bpf_strncmp(task->comm, 16, "my_task");
with the message:
16: (85) call bpf_strncmp#182
R1 type=trusted_ptr_ expected=fp, pkt, pkt_meta, map_key, map_value, mem, ringbuf_mem, buf
Teach the verifier that such access pattern is safe.
Do not allow untrusted and legacy ptr_to_btf_id to be passed into helpers.
Reported-by: David Vernet <void@manifault.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230313235845.61029-3-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
bpf_strncmp() doesn't write into its first argument.
Make sure that the verifier knows about it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230313235845.61029-2-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>