We need to protect the reader reading the sysctl value because the
value can be changed concurrently.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We need to protect the reader reading the sysctl value because the
value can be changed concurrently.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We need to protect the reader reading the sysctl value because the
value can be changed concurrently.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We need to protect the reader reading the sysctl value because the
value can be changed concurrently.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We need to protect the reader reading the sysctl value because the
value can be changed concurrently.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We need to protect the reader reading the sysctl value because the
value can be changed concurrently.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We need to protect the reader reading the sysctl value
because the value can be changed concurrently.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We need to protect the reader reading sysctl_netrom_default_path_quality
because the value can be changed concurrently.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmXoPHUACgkQrB3Eaf9P
W7ehmxAAoemzwIDP0wcDi7U68Za7wBC7CbV6WoVmDNRsO+BwnqlCtd7+B9hi0Qd0
h+KCVYw5EbUJbsHcuefj/QMNO46ueZZLswRIEMKlkZOHdC8TTGzjYmjLnkOHKTCm
wpJ9QSrnBoy3MUcWbZCJh4BZXsTftbu1fHRWy9GdBERXYfHqWdQCq/ZMAgv3IwLF
KwZahoGZwCDkmWOpshbBRGj0lnONzZ3mW//bN5EB71rSi33gPEtABYBSw9E9sdMw
uZg/xRnHMhS5CQHRnFEqVUiqu3wDJYgs3kQIDFhC1T2w94GBF/R+HzzFiBKLzxr1
Dk17avoNexSYRThJfCk6fMbXT4GVaUSKSG6KI4CRLna/wAIb4QEVDPEdk1ybOxRy
eoUfo7GXkVqhJpqnOX0Sl3262DnhQ/syhmv3sWXmoSpa630mDuFleVuVJE81dtMu
jSfaXY7BNpEwTwj8kzabKq5cLkt4T4dAnXf0ao1ATNCzcFkUjSIWN5ylnOEtdVe/
wEZYp8oc1kPEDU0RC8LzpaEooTQlPceeIAZca7a/lAhnRreGyP9j56Y2oM9YS76o
JPRrznWAdXVa0gX81zGOjZDHWSyLgpJZ9z5MaAdquP9uhXS4buu+pxxoUcFEfLJd
FV+yY9j3o3l0HLL6j/y6EZy/QzWPV+K9NAhokNL/WTwuOZ2Lq5Y=
=FnLp
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2024-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2024-03-06
1) Clear the ECN bits flowi4_tos in decode_session4().
This was already fixed but the bug was reintroduced
when decode_session4() switched to us the flow dissector.
From Guillaume Nault.
2) Fix UDP encapsulation in the TX path with packet offload mode.
From Leon Romanovsky,
3) Avoid clang fortify warning in copy_to_user_tmpl().
From Nathan Chancellor.
4) Fix inter address family tunnel in packet offload mode.
From Mike Yu.
* tag 'ipsec-2024-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: set skb control buffer based on packet offload as well
xfrm: fix xfrm child route lookup for packet offload
xfrm: Avoid clang fortify warning in copy_to_user_tmpl()
xfrm: Pass UDP encapsulation in TX packet offload
xfrm: Clear low order bits of ->flowi4_tos in decode_session4().
====================
Link: https://lore.kernel.org/r/20240306100438.3953516-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZejh3gAKCRDbK58LschI
g80KAQCJ0i1CffuizsxQcvC+xrtdZaz1D6L9oE6qJc999zDmKAD/QoipUiWTYKEW
KMrX1qoH3LBixoMFEZjSJY6kJWQV3gQ=
=Gf2l
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2024-03-06
We've added 5 non-merge commits during the last 1 day(s) which contain
a total of 5 files changed, 77 insertions(+), 4 deletions(-).
The main changes are:
1) Fix BPF verifier to check bpf_func_state->callback_depth when pruning
states as otherwise unsafe programs could get accepted,
from Eduard Zingerman.
2) Fix to zero-initialise xdp_rxq_info struct before running XDP program in
CPU map which led to random xdp_md fields, from Toke Høiland-Jørgensen.
3) Fix bonding XDP feature flags calculation when bonding device has no
slave devices anymore, from Daniel Borkmann.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
cpumap: Zero-initialise xdp_rxq_info struct before running XDP program
selftests/bpf: Fix up xdp bonding test wrt feature flags
xdp, bonding: Fix feature flags when there are no slave devs anymore
selftests/bpf: test case for callback_depth states pruning logic
bpf: check bpf_func_state->callback_depth when pruning states
====================
Link: https://lore.kernel.org/r/20240306220309.13534-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If connection isn't established yet, get_mr() will fail, trigger connection after
get_mr().
Fixes: 584a8279a44a ("RDS: RDMA: return appropriate error on rdma map failures")
Reported-and-tested-by: syzbot+d4faee732755bba9838e@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2024-03-05 (idpf, ice, i40e, igc, e1000e)
This series contains updates to idpf, ice, i40e, igc and e1000e drivers.
Emil disables local BH on NAPI schedule for proper handling of softirqs
on idpf.
Jake stops reporting of virtchannel RSS option which in unsupported on
ice.
Rand Deeb adds null check to prevent possible null pointer dereference
on ice.
Michal Schmidt moves DPLL mutex initialization to resolve uninitialized
mutex usage for ice.
Jesse fixes incorrect variable usage for calculating Tx stats on ice.
Ivan Vecera corrects logic for firmware equals check on i40e.
Florian Kauer prevents memory corruption for XDP_REDIRECT on igc.
Sasha reverts an incorrect use of FIELD_GET which caused a regression
for Wake on LAN on e1000e.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mike Yu says:
====================
In the XFRM stack, whether a packet is forwarded to the IPv4
or IPv6 stack depends on the family field of the matched SA.
This does not completely work for IPsec packet offload in some
scenario, for example, sending an IPv6 packet that will be
encrypted and encapsulated as an IPv4 packet in HW.
Here are the patches to make IPsec packet offload work on the
mentioned scenario.
====================
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This bug was noticed while re-implementing parts of the kernel
driver in userspace using spidev. The goal was to enable some
of the errata workarounds that Microchip describes in their
errata sheet [1].
Both the errata sheet and the regular datasheet of e.g. the KSZ8795
imply that you need to do this for indirect register accesses:
- write a 16-bit value to a control register pair (this value
consists of the indirect register table, and the offset inside
the table)
- either read or write an 8-bit value from the data storage
register (indicated by REG_IND_BYTE in the kernel)
The current implementation has the order swapped. It can be
proven, by reading back some indirect register with known content
(the EEE register modified in ksz8_handle_global_errata() is one of
these), that this implementation does not work.
Private discussion with Oleksij Rempel of Pengutronix has revealed
that the workaround was apparantly never tested on actual hardware.
[1] https://ww1.microchip.com/downloads/aemDocuments/documents/OTH/ProductDocuments/Errata/KSZ87xx-Errata-DS80000687C.pdf
Signed-off-by: Tobias Jakobi (Compleo) <tobias.jakobi.compleo@gmail.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Fixes: 7b6e6235b664 ("net: dsa: microchip: ksz8795: handle eee specif erratum")
Link: https://lore.kernel.org/r/20240304154135.161332-1-tobias.jakobi.compleo@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Older versions of GCC really want to know the full definition
of the type involved in rcu_assign_pointer().
struct dpll_pin is defined in a local header, net/core can't
reach it. Move all the netdev <> dpll code into dpll, where
the type is known. Otherwise we'd need multiple function calls
to jump between the compilation units.
This is the same problem the commit under fixes was trying to address,
but with rcu_assign_pointer() not rcu_dereference().
Some of the exports are not needed, networking core can't
be a module, we only need exports for the helpers used by
drivers.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/all/35a869c8-52e8-177-1d4d-e57578b99b6@linux-m68k.org/
Fixes: 640f41ed33b5 ("dpll: fix build failure due to rcu_dereference_check() on unknown type")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240305013532.694866-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When running an XDP program that is attached to a cpumap entry, we don't
initialise the xdp_rxq_info data structure being used in the xdp_buff
that backs the XDP program invocation. Tobias noticed that this leads to
random values being returned as the xdp_md->rx_queue_index value for XDP
programs running in a cpumap.
This means we're basically returning the contents of the uninitialised
memory, which is bad. Fix this by zero-initialising the rxq data
structure before running the XDP program.
Fixes: 9216477449f3 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap")
Reported-by: Tobias Böhm <tobias@aibor.de>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240305213132.11955-1-toke@redhat.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Commit 9b0ed890ac2a ("bonding: do not report NETDEV_XDP_ACT_XSK_ZEROCOPY")
changed the driver from reporting everything as supported before a device
was bonded into having the driver report that no XDP feature is supported
until a real device is bonded as it seems to be more truthful given
eventually real underlying devices decide what XDP features are supported.
The change however did not take into account when all slave devices get
removed from the bond device. In this case after 9b0ed890ac2a, the driver
keeps reporting a feature mask of 0x77, that is, NETDEV_XDP_ACT_MASK &
~NETDEV_XDP_ACT_XSK_ZEROCOPY whereas it should have reported a feature
mask of 0.
Fix it by resetting XDP feature flags in the same way as if no XDP program
is attached to the bond device. This was uncovered by the XDP bond selftest
which let BPF CI fail. After adjusting the starting masks on the latter
to 0 instead of NETDEV_XDP_ACT_MASK the test passes again together with
this fix.
Fixes: 9b0ed890ac2a ("bonding: do not report NETDEV_XDP_ACT_XSK_ZEROCOPY")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Prashant Batra <prbatra.mail@gmail.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Message-ID: <20240305090829.17131-1-daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman says:
====================
check bpf_func_state->callback_depth when pruning states
This patch-set fixes bug in states pruning logic hit in mailing list
discussion [0]. The details of the fix are in patch #1.
The main idea for the fix belongs to Yonghong Song,
mine contribution is merely in review and test cases.
There are some changes in verification performance:
File Program Insns (DIFF) States (DIFF)
------------------------- ------------- --------------- --------------
pyperf600_bpf_loop.bpf.o on_event +15 (+0.42%) +0 (+0.00%)
strobemeta_bpf_loop.bpf.o on_event +857 (+37.95%) +60 (+38.96%)
xdp_synproxy_kern.bpf.o syncookie_tc +2892 (+30.39%) +109 (+36.33%)
xdp_synproxy_kern.bpf.o syncookie_xdp +2892 (+30.01%) +109 (+36.09%)
(when tested on a subset of selftests identified by
selftests/bpf/veristat.cfg and Cilium bpf object files from [4])
Changelog:
v2 [2] -> v3:
- fixes for verifier.c commit message as suggested by Yonghong;
- patch-set re-rerouted to 'bpf' tree as suggested in [2];
- patch for test_tcp_custom_syncookie is sent separately to 'bpf-next' [3].
- veristat results updated using 'bpf' tree as baseline and clang 16.
v1 [1] -> v2:
- patch #2 commit message updated to better reflect verifier behavior
with regards to checkpoints tree (suggested by Yonghong);
- veristat results added (suggested by Andrii).
[0] https://lore.kernel.org/bpf/9b251840-7cb8-4d17-bd23-1fc8071d8eef@linux.dev/
[1] https://lore.kernel.org/bpf/20240212143832.28838-1-eddyz87@gmail.com/
[2] https://lore.kernel.org/bpf/20240216150334.31937-1-eddyz87@gmail.com/
[3] https://lore.kernel.org/bpf/20240222150300.14909-1-eddyz87@gmail.com/
[4] https://github.com/anakryiko/cilium
====================
Link: https://lore.kernel.org/r/20240222154121.6991-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The test case was minimized from mailing list discussion [0].
It is equivalent to the following C program:
struct iter_limit_bug_ctx { __u64 a; __u64 b; __u64 c; };
static __naked void iter_limit_bug_cb(void)
{
switch (bpf_get_prandom_u32()) {
case 1: ctx->a = 42; break;
case 2: ctx->b = 42; break;
default: ctx->c = 42; break;
}
}
int iter_limit_bug(struct __sk_buff *skb)
{
struct iter_limit_bug_ctx ctx = { 7, 7, 7 };
bpf_loop(2, iter_limit_bug_cb, &ctx, 0);
if (ctx.a == 42 && ctx.b == 42 && ctx.c == 7)
asm volatile("r1 /= 0;":::"r1");
return 0;
}
The main idea is that each loop iteration changes one of the state
variables in a non-deterministic manner. Hence it is premature to
prune the states that have two iterations left comparing them to
states with one iteration left.
E.g. {{7,7,7}, callback_depth=0} can reach state {42,42,7},
while {{7,7,7}, callback_depth=1} can't.
[0] https://lore.kernel.org/bpf/9b251840-7cb8-4d17-bd23-1fc8071d8eef@linux.dev/
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240222154121.6991-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When comparing current and cached states verifier should consider
bpf_func_state->callback_depth. Current state cannot be pruned against
cached state, when current states has more iterations left compared to
cached state. Current state has more iterations left when it's
callback_depth is smaller.
Below is an example illustrating this bug, minimized from mailing list
discussion [0] (assume that BPF_F_TEST_STATE_FREQ is set).
The example is not a safe program: if loop_cb point (1) is followed by
loop_cb point (2), then division by zero is possible at point (4).
struct ctx {
__u64 a;
__u64 b;
__u64 c;
};
static void loop_cb(int i, struct ctx *ctx)
{
/* assume that generated code is "fallthrough-first":
* if ... == 1 goto
* if ... == 2 goto
* <default>
*/
switch (bpf_get_prandom_u32()) {
case 1: /* 1 */ ctx->a = 42; return 0; break;
case 2: /* 2 */ ctx->b = 42; return 0; break;
default: /* 3 */ ctx->c = 42; return 0; break;
}
}
SEC("tc")
__failure
__flag(BPF_F_TEST_STATE_FREQ)
int test(struct __sk_buff *skb)
{
struct ctx ctx = { 7, 7, 7 };
bpf_loop(2, loop_cb, &ctx, 0); /* 0 */
/* assume generated checks are in-order: .a first */
if (ctx.a == 42 && ctx.b == 42 && ctx.c == 7)
asm volatile("r0 /= 0;":::"r0"); /* 4 */
return 0;
}
Prior to this commit verifier built the following checkpoint tree for
this example:
.------------------------------------- Checkpoint / State name
| .-------------------------------- Code point number
| | .---------------------------- Stack state {ctx.a,ctx.b,ctx.c}
| | | .------------------- Callback depth in frame #0
v v v v
- (0) {7P,7P,7},depth=0
- (3) {7P,7P,7},depth=1
- (0) {7P,7P,42},depth=1
- (3) {7P,7,42},depth=2
- (0) {7P,7,42},depth=2 loop terminates because of depth limit
- (4) {7P,7,42},depth=0 predicted false, ctx.a marked precise
- (6) exit
(a) - (2) {7P,7,42},depth=2
- (0) {7P,42,42},depth=2 loop terminates because of depth limit
- (4) {7P,42,42},depth=0 predicted false, ctx.a marked precise
- (6) exit
(b) - (1) {7P,7P,42},depth=2
- (0) {42P,7P,42},depth=2 loop terminates because of depth limit
- (4) {42P,7P,42},depth=0 predicted false, ctx.{a,b} marked precise
- (6) exit
- (2) {7P,7,7},depth=1 considered safe, pruned using checkpoint (a)
(c) - (1) {7P,7P,7},depth=1 considered safe, pruned using checkpoint (b)
Here checkpoint (b) has callback_depth of 2, meaning that it would
never reach state {42,42,7}.
While checkpoint (c) has callback_depth of 1, and thus
could yet explore the state {42,42,7} if not pruned prematurely.
This commit makes forbids such premature pruning,
allowing verifier to explore states sub-tree starting at (c):
(c) - (1) {7,7,7P},depth=1
- (0) {42P,7,7P},depth=1
...
- (2) {42,7,7},depth=2
- (0) {42,42,7},depth=2 loop terminates because of depth limit
- (4) {42,42,7},depth=0 predicted true, ctx.{a,b,c} marked precise
- (5) division by zero
[0] https://lore.kernel.org/bpf/9b251840-7cb8-4d17-bd23-1fc8071d8eef@linux.dev/
Fixes: bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations")
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240222154121.6991-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Refactoring of the field get conversion introduced a regression in the
legacy Wake On Lan from a magic packet with i219 devices. Rx address
copied not correctly from MAC to PHY with FIELD_GET macro.
Fixes: b9a452545075 ("intel: legacy: field get conversion")
Suggested-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When a frame can not be transmitted in XDP_REDIRECT
(e.g. due to a full queue), it is necessary to free
it by calling xdp_return_frame_rx_napi.
However, this is the responsibility of the caller of
the ndo_xdp_xmit (see for example bq_xmit_all in
kernel/bpf/devmap.c) and thus calling it inside
igc_xdp_xmit (which is the ndo_xdp_xmit of the igc
driver) as well will lead to memory corruption.
In fact, bq_xmit_all expects that it can return all
frames after the last successfully transmitted one.
Therefore, break for the first not transmitted frame,
but do not call xdp_return_frame_rx_napi in igc_xdp_xmit.
This is equally implemented in other Intel drivers
such as the igb.
There are two alternatives to this that were rejected:
1. Return num_frames as all the frames would have been
transmitted and release them inside igc_xdp_xmit.
While it might work technically, it is not what
the return value is meant to represent (i.e. the
number of SUCCESSFULLY transmitted packets).
2. Rework kernel/bpf/devmap.c and all drivers to
support non-consecutively dropped packets.
Besides being complex, it likely has a negative
performance impact without a significant gain
since it is anyway unlikely that the next frame
can be transmitted if the previous one was dropped.
The memory corruption can be reproduced with
the following script which leads to a kernel panic
after a few seconds. It basically generates more
traffic than a i225 NIC can transmit and pushes it
via XDP_REDIRECT from a virtual interface to the
physical interface where frames get dropped.
#!/bin/bash
INTERFACE=enp4s0
INTERFACE_IDX=`cat /sys/class/net/$INTERFACE/ifindex`
sudo ip link add dev veth1 type veth peer name veth2
sudo ip link set up $INTERFACE
sudo ip link set up veth1
sudo ip link set up veth2
cat << EOF > redirect.bpf.c
SEC("prog")
int redirect(struct xdp_md *ctx)
{
return bpf_redirect($INTERFACE_IDX, 0);
}
char _license[] SEC("license") = "GPL";
EOF
clang -O2 -g -Wall -target bpf -c redirect.bpf.c -o redirect.bpf.o
sudo ip link set veth2 xdp obj redirect.bpf.o
cat << EOF > pass.bpf.c
SEC("prog")
int pass(struct xdp_md *ctx)
{
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
EOF
clang -O2 -g -Wall -target bpf -c pass.bpf.c -o pass.bpf.o
sudo ip link set $INTERFACE xdp obj pass.bpf.o
cat << EOF > trafgen.cfg
{
/* Ethernet Header */
0xe8, 0x6a, 0x64, 0x41, 0xbf, 0x46,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
const16(ETH_P_IP),
/* IPv4 Header */
0b01000101, 0, # IPv4 version, IHL, TOS
const16(1028), # IPv4 total length (UDP length + 20 bytes (IP header))
const16(2), # IPv4 ident
0b01000000, 0, # IPv4 flags, fragmentation off
64, # IPv4 TTL
17, # Protocol UDP
csumip(14, 33), # IPv4 checksum
/* UDP Header */
10, 0, 1, 1, # IP Src - adapt as needed
10, 0, 1, 2, # IP Dest - adapt as needed
const16(6666), # UDP Src Port
const16(6666), # UDP Dest Port
const16(1008), # UDP length (UDP header 8 bytes + payload length)
csumudp(14, 34), # UDP checksum
/* Payload */
fill('W', 1000),
}
EOF
sudo trafgen -i trafgen.cfg -b3000MB -o veth1 --cpp
Fixes: 4ff320361092 ("igc: Add support for XDP_REDIRECT action")
Signed-off-by: Florian Kauer <florian.kauer@linutronix.de>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Helper i40e_is_fw_ver_eq() compares incorrectly given firmware version
as it returns true when the major version of running firmware is
greater than the given major version that is wrong and results in
failure during getting of DCB configuration where this helper is used.
Fix the check and return true only if the running FW version is exactly
equals to the given version.
Reproducer:
1. Load i40e driver
2. Check dmesg output
[root@host ~]# modprobe i40e
[root@host ~]# dmesg | grep 'i40e.*DCB'
[ 74.750642] i40e 0000:02:00.0: Query for DCB configuration failed, err -EIO aq_err I40E_AQ_RC_EINVAL
[ 74.759770] i40e 0000:02:00.0: DCB init failed -5, disabled
[ 74.966550] i40e 0000:02:00.1: Query for DCB configuration failed, err -EIO aq_err I40E_AQ_RC_EINVAL
[ 74.975683] i40e 0000:02:00.1: DCB init failed -5, disabled
Fixes: cf488e13221f ("i40e: Add other helpers to check version of running firmware and AQ API")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Fix an obviously incorrect assignment, created with a typo or cut-n-paste
error.
Fixes: 5995ef88e3a8 ("ice: realloc VSI stats arrays")
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The function ice_bridge_setlink() may encounter a NULL pointer dereference
if nlmsg_find_attr() returns NULL and br_spec is dereferenced subsequently
in nla_for_each_nested(). To address this issue, add a check to ensure that
br_spec is not NULL before proceeding with the nested attribute iteration.
Fixes: b1edc14a3fbf ("ice: Implement ice_bridge_getlink and ice_bridge_setlink")
Signed-off-by: Rand Deeb <rand.sec96@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The E800 series hardware uses the same iAVF driver as older devices,
including the virtchnl negotiation scheme.
This negotiation scheme includes a mechanism to determine what type of RSS
should be supported, including RSS over PF virtchnl messages, RSS over
firmware AdminQ messages, and RSS via direct register access.
The PF driver will always prefer VIRTCHNL_VF_OFFLOAD_RSS_PF if its
supported by the VF driver. However, if an older VF driver is loaded, it
may request only VIRTCHNL_VF_OFFLOAD_RSS_REG or VIRTCHNL_VF_OFFLOAD_RSS_AQ.
The ice driver happily agrees to support these methods. Unfortunately, the
underlying hardware does not support these mechanisms. The E800 series VFs
don't have the appropriate registers for RSS_REG. The mailbox queue used by
VFs for VF to PF communication blocks messages which do not have the
VF-to-PF opcode.
Stop lying to the VF that it could support RSS over AdminQ or registers, as
these interfaces do not work when the hardware is operating on an E800
series device.
In practice this is unlikely to be hit by any normal user. The iAVF driver
has supported RSS over PF virtchnl commands since 2016, and always defaults
to using RSS_PF if possible.
In principle, nothing actually stops the existing VF from attempting to
access the registers or send an AQ command. However a properly coded VF
will check the capability flags and will report a more useful error if it
detects a case where the driver does not support the RSS offloads that it
does.
Fixes: 1071a8358a28 ("ice: Implement virtchnl commands for AVF support")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Alan Brady <alan.brady@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Fix softirq's not being handled during napi_schedule() call when
receiving marker packets for queue disable by disabling local bottom
half.
The issue can be seen on ifdown:
NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
Using ftrace to catch the failing scenario:
ifconfig [003] d.... 22739.830624: softirq_raise: vec=3 [action=NET_RX]
<idle>-0 [003] ..s.. 22739.831357: softirq_entry: vec=3 [action=NET_RX]
No interrupt and CPU is idle.
After the patch when disabling local BH before calling napi_schedule:
ifconfig [003] d.... 22993.928336: softirq_raise: vec=3 [action=NET_RX]
ifconfig [003] ..s1. 22993.928337: softirq_entry: vec=3 [action=NET_RX]
Fixes: c2d548cad150 ("idpf: add TX splitq napi poll support")
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Alan Brady <alan.brady@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In packet offload, packets are not encrypted in XFRM stack, so
the next network layer which the packets will be forwarded to
should depend on where the packet came from (either xfrm4_output
or xfrm6_output) rather than the matched SA's family type.
Test: verified IPv6-in-IPv4 packets on Android device with
IPsec packet offload enabled
Signed-off-by: Mike Yu <yumike@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In current code, xfrm_bundle_create() always uses the matched
SA's family type to look up a xfrm child route for the skb.
The route returned by xfrm_dst_lookup() will eventually be
used in xfrm_output_resume() (skb_dst(skb)->ops->local_out()).
If packet offload is used, the above behavior can lead to
calling ip_local_out() for an IPv6 packet or calling
ip6_local_out() for an IPv4 packet, which is likely to fail.
This change fixes the behavior by checking if the matched SA
has packet offload enabled. If not, keep the same behavior;
if yes, use the matched SP's family type for the lookup.
Test: verified IPv6-in-IPv4 packets on Android device with
IPsec packet offload enabled
Signed-off-by: Mike Yu <yumike@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmXizwMACgkQSD+KveBX
+j5uTAf/aI+qZtPXaCykJ86E0IfyPbFHNIK0OSEGuYGXXMABa6s/nBO87qkCVjZc
Lpnr7gj3plGwrHQTY30Ii3h6UpLpSY+LMhGKfGjtQAHiY4PIMIrTIcqp2+H4Zzxm
eK8DT/YUNPs/NT4GAKuwLkwxh9W1dj4fPac2kSth1UqKJnn9Y+GyCfwu4oVL+jhX
wT3P0F04ettHJN71xznmTOMWOBlWExchbdOi07tSvFmUMyDzRAmhFinE/1SrDxnl
l3nB5Qrhe5J1wLJH18gSju5k87sgsjevRqWDwZY+TFf3PEs/HUUqFMP1nmQf4LVF
pW3STDnPI/UyB4GzLn+Z5oDHaARRHw==
=6rTh
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2024-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2024-03-01
This series provides bug fixes to mlx5 driver.
Please pull and let me know if there is any problem.
* tag 'mlx5-fixes-2024-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Switch to using _bh variant of of spinlock API in port timestamping NAPI poll context
net/mlx5e: Use a memory barrier to enforce PTP WQ xmit submission tracking occurs after populating the metadata_map
net/mlx5e: Fix MACsec state loss upon state update in offload path
net/mlx5e: Change the warning when ignore_flow_level is not supported
net/mlx5: Check capability for fw_reset
net/mlx5: Fix fw reporter diagnose output
net/mlx5: E-switch, Change flow rule destination checking
Revert "net/mlx5e: Check the number of elements before walk TC rhashtable"
Revert "net/mlx5: Block entering switchdev mode with ns inconsistency"
====================
Link: https://lore.kernel.org/r/20240302070318.62997-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2024-03-01 (ixgbe, i40e, ice)
This series contains updates to ixgbe, i40e, and ice drivers.
Maciej corrects disable flow for ixgbe, i40e, and ice drivers which could
cause non-functional interface with AF_XDP.
Michal restores host configuration when changing MSI-X count for VFs on
ice driver.
* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: reconfig host after changing MSI-X on VF
ice: reorder disabling IRQ and NAPI in ice_qp_dis
i40e: disable NAPI right after disabling irqs when handling xsk_pool
ixgbe: {dis, en}able irqs in ixgbe_txrx_ring_{dis, en}able
====================
Link: https://lore.kernel.org/r/20240301192549.2993798-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Based on the static analyzis of the code it looks like when an entry
from the MAC table was removed, the entry was still used after being
freed. More precise the vid of the mac_entry was used after calling
devm_kfree on the mac_entry.
The fix consists in first using the vid of the mac_entry to delete the
entry from the HW and after that to free it.
Fixes: b37a1bae742f ("net: sparx5: add mactable support")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240301080608.3053468-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts says:
====================
selftests: mptcp: fixes for diag.sh
Here are two patches fixing issues in MPTCP diag.sh kselftest:
- Patch 1 makes sure the exit code is '1' in case of error, and not the
test ID, not to return an exit code that would be wrongly interpreted
by the ksefltests framework, e.g. '4' means 'skip'.
- Patch 2 avoids waiting for unnecessary conditions, which can cause
timeouts in some very slow environments.
====================
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
When creating a lot of listener sockets, it is enough to wait only for
the last one, like we are doing before in diag.sh for other subtests.
If we do a check for each listener sockets, each time listing all
available sockets, it can take a very long time in very slow
environments, at the point we can reach some timeout.
When using the debug kconfig, the waiting time switches from more than
8 sec to 0.1 sec on my side. In slow/busy environments, and with a poll
timeout set to 30 ms, the waiting time could go up to ~100 sec because
the listener socket would timeout and stop, while the script would still
be checking one by one if all sockets are ready. The result is that
after having waited for everything to be ready, all sockets have been
stopped due to a timeout, and it is too late for the script to check how
many there were.
While at it, also removed ss options we don't need: we only need the
filtering options, to count how many listener sockets have been created.
We don't need to ask ss to display internal TCP information, and the
memory if the output is dropped by the 'wc -l' command anyway.
Fixes: b4b51d36bbaa ("selftests: mptcp: explicitly trigger the listener diag code-path")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/r/20240301063754.2ecefecf@kernel.org
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The test counter 'test_cnt' should not be returned in diag.sh, e.g. what
if only the 4th test fail? Will do 'exit 4' which is 'exit ${KSFT_SKIP}',
the whole test will be marked as skipped instead of 'failed'!
So we should do ret=${KSFT_FAIL} instead.
Fixes: df62f2ec3df6 ("selftests/mptcp: add diag interface tests")
Cc: stable@vger.kernel.org
Fixes: 42fb6cddec3b ("selftests: mptcp: more stable diag tests")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If message fills up we need to stop writing. 'break' will
only get us out of the iteration over pools of a single
netdev, we need to also stop walking netdevs.
This results in either infinite dump, or missing pools,
depending on whether message full happens on the last
netdev (infinite dump) or non-last (missing pools).
Fixes: 950ab53b77ab ("net: page_pool: implement GET in the netlink API")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm updating __assign_str() and will be removing the second parameter. To
make sure that it does not break anything, I make sure that it matches the
__string() field, as that is where the string is actually going to be
saved in. To make sure there's nothing that breaks, I added a WARN_ON() to
make sure that what was used in __string() is the same that is used in
__assign_str().
In doing this change, an error was triggered as __assign_str() now expects
the string passed in to be a char * value. I instead had the following
warning:
include/trace/events/qdisc.h: In function ‘trace_event_raw_event_qdisc_reset’:
include/trace/events/qdisc.h:91:35: error: passing argument 1 of 'strcmp' from incompatible pointer type [-Werror=incompatible-pointer-types]
91 | __assign_str(dev, qdisc_dev(q));
That's because the qdisc_enqueue() and qdisc_reset() pass in qdisc_dev(q)
to __assign_str() and to __string(). But that function returns a pointer
to struct net_device and not a string.
It appears that these events are just saving the pointer as a string and
then reading it as a string as well.
Use qdisc_dev(q)->name to save the device instead.
Fixes: a34dac0b90552 ("net_sched: add tracepoints for qdisc_reset() and qdisc_destroy()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NAPI poll context is a softirq context. Do not use normal spinlock API
in this context to prevent concurrency issues.
Fixes: 3178308ad4ca ("net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
CC: Vadim Fedorenko <vadfed@meta.com>
Just simply reordering the functions mlx5e_ptp_metadata_map_put and
mlx5e_ptpsq_track_metadata in the mlx5e_txwqe_complete context is not good
enough since both the compiler and CPU are free to reorder these two
functions. If reordering does occur, the issue that was supposedly fixed by
7e3f3ba97e6c ("net/mlx5e: Track xmit submission to PTP WQ after populating
metadata map") will be seen. This will lead to NULL pointer dereferences in
mlx5e_ptpsq_mark_ts_cqes_undelivered in the NAPI polling context due to the
tracking list being populated before the metadata map.
Fixes: 7e3f3ba97e6c ("net/mlx5e: Track xmit submission to PTP WQ after populating metadata map")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
CC: Vadim Fedorenko <vadfed@meta.com>
The packet number attribute of the SA is incremented by the device rather
than the software stack when enabling hardware offload. Because the packet
number attribute is managed by the hardware, the software has no insight
into the value of the packet number attribute actually written by the
device.
Previously when MACsec offload was enabled, the hardware object for
handling the offload was destroyed when the SA was disabled. Re-enabling
the SA would lead to a new hardware object being instantiated. This new
hardware object would not have any recollection of the correct packet
number for the SA. Instead, destroy the flow steering rule when
deactivating the SA and recreate it upon reactivation, preserving the
original hardware object.
Fixes: 8ff0ac5be144 ("net/mlx5: Add MACsec offload Tx command support")
Signed-off-by: Emeel Hakim <ehakim@nvidia.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Downgrade the print from mlx5_core_warn() to mlx5_core_dbg(), as it
is just a statement of fact that firmware doesn't support ignore flow
level.
And change the wording to "firmware flow level support is missing", to
make it more accurate.
Fixes: ae2ee3be99a8 ("net/mlx5: CT: Remove warning of ignore_flow_level support for VFs")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Suggested-by: Elliott, Robert (Servers) <elliott@hpe.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Functions which can't access MFRL (Management Firmware Reset Level)
register, have no use of fw_reset structures or events. Remove fw_reset
structures allocation and registration for fw reset events notifications
for these functions.
Having the devlink param enable_remote_dev_reset on functions that don't
have this capability is misleading as these functions are not allowed to
influence the reset flow. Hence, this patch removes this parameter for
such functions.
In addition, return not supported on devlink reload action fw_activate
for these functions.
Fixes: 38b9f903f22b ("net/mlx5: Handle sync reset request event")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Restore fw reporter diagnose to print the syndrome even if it is zero.
Following the cited commit, in this case (syndrome == 0) command returns no
output at all.
This fix restores command output in case syndrome is cleared:
$ devlink health diagnose pci/0000:82:00.0 reporter fw
Syndrome: 0
Fixes: d17f98bf7cc9 ("net/mlx5: devlink health: use retained error fmsg API")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The checking in the cited commit is not accurate. In the common case,
VF destination is internal, and uplink destination is external.
However, uplink destination with packet reformat is considered as
internal because firmware uses LB+hairpin to support it. Update the
checking so header rewrite rules with both internal and external
destinations are not allowed.
Fixes: e0e22d59b47a ("net/mlx5: E-switch, Add checking for flow rule destinations")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This reverts commit 4e25b661f484df54b6751b65f9ea2434a3b67539.
This Commit was mistakenly applied by pulling the wrong tag, remove it.
Fixes: 4e25b661f484 ("net/mlx5e: Check the number of elements before walk TC rhashtable")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
This reverts commit 662404b24a4c4d839839ed25e3097571f5938b9b.
The revert is required due to the suspicion it is not good for anything
and cause crash.
Fixes: 662404b24a4c ("net/mlx5e: Block entering switchdev mode with ns inconsistency")
Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>