Add support for the Renesas RZ/N1 GMAC. This support can make use of a
custom RZ/N1 PCS which is fetched by parsing the pcs-handle device tree
property.
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Co-developed-by: Romain Gantois <romain.gantois@bootlin.com>
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://lore.kernel.org/r/20240513-rzn1-gmac1-v7-6-6acf58b5440d@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the newly introduced pcs_init() and pcs_exit() operations to
create and destroy the PCS instance at a more appropriate moment during
the driver lifecycle, thereby avoiding publishing a network device to
userspace that has not yet finished its PCS initialisation.
There are other similar issues with this driver which remain
unaddressed, but these are out of scope for this patch.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
[rgantois: removed second parameters of new callbacks]
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://lore.kernel.org/r/20240513-rzn1-gmac1-v7-5-6acf58b5440d@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce a mechanism whereby platforms can create their PCS instances
prior to the network device being published to userspace, but after
some of the core stmmac initialisation has been completed. This means
that the data structures that platforms need will be available.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Co-developed-by: Romain Gantois <romain.gantois@bootlin.com>
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://lore.kernel.org/r/20240513-rzn1-gmac1-v7-4-6acf58b5440d@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A pcs_init() callback will be introduced to stmmac in a future patch. This
new function will be called during the hardware initialization phase.
Instead of separately initializing XPCS and PCS components, let's group all
PCS-related hardware initialization logic in the current
stmmac_xpcs_setup() function.
Rename stmmac_xpcs_setup() to stmmac_pcs_setup() and move the conditional
call to stmmac_xpcs_setup() inside the function itself.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Co-developed-by: Romain Gantois <romain.gantois@bootlin.com>
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://lore.kernel.org/r/20240513-rzn1-gmac1-v7-3-6acf58b5440d@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the XPCS handler destruction is performed in the
stmmac_mdio_unregister() method. It doesn't look good because the handler
isn't originally created in the corresponding protagonist
stmmac_mdio_unregister(), but in the stmmac_xpcs_setup() function. In
order to have more coherent MDIO and XPCS setup/cleanup procedures,
let's move the DW XPCS destruction to the dedicated stmmac_pcs_clean()
method.
This method will also be used to cleanup PCS hardware using the
pcs_exit() callback that will be introduced to stmmac in a subsequent
patch.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Co-developed-by: Romain Gantois <romain.gantois@bootlin.com>
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://lore.kernel.org/r/20240513-rzn1-gmac1-v7-2-6acf58b5440d@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RZ/N1 series of MPUs feature up to two Gigabit Ethernet controllers.
These controllers are based on Synopsys IPs. They can be connected to
RZ/N1 RGMII/RMII converters.
Add a binding that describes these GMAC devices.
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
[rgantois: commit log]
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Link: https://lore.kernel.org/r/20240513-rzn1-gmac1-v7-1-6acf58b5440d@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This driver currently doesn't support any control flags.
Use flow_rule_match_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.
In case any control flags are masked, flow_rule_match_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.
Only compile-tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240511073705.230507-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xuan Zhuo says:
====================
virtio_net: rx enable premapped mode by default
Actually, for the virtio drivers, we can enable premapped mode whatever
the value of use_dma_api. Because we provide the virtio dma apis.
So the driver can enable premapped mode unconditionally.
This patch set makes the big mode of virtio-net to support premapped mode.
And enable premapped mode for rx by default.
Based on the following points, we do not use page pool to manage these
pages:
1. virtio-net uses the DMA APIs wrapped by virtio core. Therefore,
we can only prevent the page pool from performing DMA operations, and
let the driver perform DMA operations on the allocated pages.
2. But when the page pool releases the page, we have no chance to
execute dma unmap.
3. A solution to #2 is to execute dma unmap every time before putting
the page back to the page pool. (This is actually a waste, we don't
execute unmap so frequently.)
4. But there is another problem, we still need to use page.dma_addr to
save the dma address. Using page.dma_addr while using page pool is
unsafe behavior.
5. And we need space the chain the pages submitted once to virtio core.
More:
https://lore.kernel.org/all/CACGkMEu=Aok9z2imB_c5qVuujSh=vjj1kx12fy9N7hqyi+M5Ow@mail.gmail.com/
Why we do not use the page space to store the dma?
http://lore.kernel.org/all/CACGkMEuyeJ9mMgYnnB42=hw6umNuo=agn7VBqBqYPd7GN=+39Q@mail.gmail.com
====================
Link: https://lore.kernel.org/r/20240511031404.30903-1-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We call the build_skb() actually without copying data.
The comment is misleading. So remove it.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20240511031404.30903-5-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now, the premapped mode can be enabled unconditionally.
So we can remove the failover code for merge and small mode.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://lore.kernel.org/r/20240511031404.30903-4-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The virtio-net big mode did not enable premapped mode,
so we did not need to check the unmap. And the subsequent
commit will remove the failover code for failing enable
premapped for merge and small mode. So we need to remove
the checking do_dma code in the big mode path.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20240511031404.30903-3-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now, we have virtio DMA APIs, the driver can be the premapped
mode whatever the virtio core uses dma api or not.
So remove the limit of checking use_dma_api from
virtqueue_set_dma_premapped().
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20240511031404.30903-2-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZkGcZAAKCRDbK58LschI
g6o6APwLsqhrM2w71VUN5ciCxu4H5VDtZp6wkdqtVbxxU4qNxQEApKgYgKt8ZLF3
Kily5c7m+S4ZXhMX21rb8JhSAz0dfQk=
=5Dk7
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-05-13
We've added 119 non-merge commits during the last 14 day(s) which contain
a total of 134 files changed, 9462 insertions(+), 4742 deletions(-).
The main changes are:
1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi.
2) Add BPF range computation improvements to the verifier in particular
around XOR and OR operators, refactoring of checks for range computation
and relaxing MUL range computation so that src_reg can also be an unknown
scalar, from Cupertino Miranda.
3) Add support to attach kprobe BPF programs through kprobe_multi link in
a session mode, meaning, a BPF program is attached to both function entry
and return, the entry program can decide if the return program gets
executed and the entry program can share u64 cookie value with return
program. Session mode is a common use-case for tetragon and bpftrace,
from Jiri Olsa.
4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf
as well as BPF selftest's struct_ops handling, from Andrii Nakryiko.
5) Improvements to BPF selftests in context of BPF gcc backend,
from Jose E. Marchesi & David Faust.
6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test-
-style in order to retire the old test, run it in BPF CI and additionally
expand test coverage, from Jordan Rife.
7) Big batch for BPF selftest refactoring in order to remove duplicate code
around common network helpers, from Geliang Tang.
8) Another batch of improvements to BPF selftests to retire obsolete
bpf_tcp_helpers.h as everything is available vmlinux.h,
from Martin KaFai Lau.
9) Fix BPF map tear-down to not walk the map twice on free when both timer
and wq is used, from Benjamin Tissoires.
10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL,
from Alexei Starovoitov.
11) Change BTF build scripts to using --btf_features for pahole v1.26+,
from Alan Maguire.
12) Small improvements to BPF reusing struct_size() and krealloc_array(),
from Andy Shevchenko.
13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions,
from Ilya Leoshkevich.
14) Extend TCP ->cong_control() callback in order to feed in ack and
flag parameters and allow write-access to tp->snd_cwnd_stamp
from BPF program, from Miao Xu.
15) Add support for internal-only per-CPU instructions to inline
bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs,
from Puranjay Mohan.
16) Follow-up to remove the redundant ethtool.h from tooling infrastructure,
from Tushar Vyavahare.
17) Extend libbpf to support "module:<function>" syntax for tracing
programs, from Viktor Malik.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
bpf: make list_for_each_entry portable
bpf: ignore expected GCC warning in test_global_func10.c
bpf: disable strict aliasing in test_global_func9.c
selftests/bpf: Free strdup memory in xdp_hw_metadata
selftests/bpf: Fix a few tests for GCC related warnings.
bpf: avoid gcc overflow warning in test_xdp_vlan.c
tools: remove redundant ethtool.h from tooling infra
selftests/bpf: Expand ATTACH_REJECT tests
selftests/bpf: Expand getsockname and getpeername tests
sefltests/bpf: Expand sockaddr hook deny tests
selftests/bpf: Expand sockaddr program return value tests
selftests/bpf: Retire test_sock_addr.(c|sh)
selftests/bpf: Remove redundant sendmsg test cases
selftests/bpf: Migrate ATTACH_REJECT test cases
selftests/bpf: Migrate expected_attach_type tests
selftests/bpf: Migrate wildcard destination rewrite test
selftests/bpf: Migrate sendmsg6 v4 mapped address tests
selftests/bpf: Migrate sendmsg deny test cases
selftests/bpf: Migrate WILDCARD_IP test
selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases
...
====================
Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nothing useful is done with the LPA variable in lynx_pcs_get_state_2500basex(),
we can just remove the read.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240513115345.2452799-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan says:
====================
mlx5 misc patches
This series includes patches for the mlx5 driver.
Patch 1 by Shay enables LAG with HCAs of 8 ports.
Patch 2 by Carolina optimizes the safe switch channels operation for the
TX-only changes.
Patch 3 by Parav cleans up some unused code.
====================
Link: https://lore.kernel.org/r/20240512124306.740898-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MSIX irq allocation and free APIs are no longer
in use. Hence, remove the dead code.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://lore.kernel.org/r/20240512124306.740898-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It is not appropriate for the mlx5e_num_channels_changed
function to be called solely for updating the TX queues,
even if the channels number has not been changed.
Move the code responsible for updating the TC and TX queues
from mlx5e_num_channels_changed and produce a new function
called mlx5e_update_tc_and_tx_queues. This new function should
only be called when the channels number remains unchanged.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512124306.740898-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds to mlx5 drivers support for 8 ports HCAs.
Starting with ConnectX-8 HCAs with 8 ports are possible.
As most driver parts aren't affected by such configuration most driver
code is unchanged.
Specially the only affected areas are:
- Lag
- Multiport E-Switch
- Single FDB E-Switch
All of the above are already factored in generic way, and LAG and VF LAG
are tested, so all that left is to change a #define and remove checks
which are no longer needed.
However, Multiport E-Switch is not tested yet, so it is left untouched.
This patch will allow to create hardware LAG/VF LAG when all 8 ports are
added to the same bond device.
for example, In order to activate the hardware lag a user can execute
the following:
ip link add bond0 type bond
ip link set bond0 type bond miimon 100 mode 2
ip link set eth2 master bond0
ip link set eth3 master bond0
ip link set eth4 master bond0
ip link set eth5 master bond0
ip link set eth6 master bond0
ip link set eth7 master bond0
ip link set eth8 master bond0
ip link set eth9 master bond0
Where eth2, eth3, eth4, eth5, eth6, eth7, eth8 and eth9 are the PFs of
the same HCA.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512124306.740898-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After this change the single SAN device (ns3eth1) is now replaced with
two SAN devices - respectively ns4eth1 and ns5eth1.
It is possible to extend this script to have more SAN devices connected
by adding them to ns3br1 bridge.
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Link: https://lore.kernel.org/r/20240510143710.3916631-1-lukma@denx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Oleksij Rempel says:
====================
net: dsa: microchip: DCB fixes
This patch series address recommendation to rename IPV to IPM to avoid
confusion with IPV name used in 802.1Qci PSFP. And restores default "PCP
only" configuration as source of priorities to avoid possible
regressions.
====================
Link: https://lore.kernel.org/r/20240510053828.2412516-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before DCB support, the KSZ driver had only PCP as source of packet
priority values. To avoid regressions, make PCP only as default value.
User will need enable DSCP support manually.
This patch do not affect other KSZ8 related quirks. User will still be
warned by setting not support configurations for the port 2.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240510053828.2412516-4-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All other functions are commented. Add missing comments to following
functions:
ksz_set_global_dscp_entry()
ksz_port_add_dscp_prio()
ksz_port_del_dscp_prio()
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240510053828.2412516-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPV is added and used term in 802.1Qci PSFP and merged into 802.1Q (from
802.1Q-2018) for another functions.
Even it does similar operation holding temporal priority value
internally (as it is named), because KSZ datasheet doesn't use the term
of IPV (Internal Priority Value) and avoiding any confusion later when
PSFP is in the Linux world, it is better to rename IPV to IPM (Internal
Priority Mapping).
In addition, LAN937x documentation already use IPV for 802.1Qci PSFP
related functionality.
Suggested-by: Woojung Huh <Woojung.Huh@microchip.com>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Woojung Huh <woojung.huh@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240510053828.2412516-2-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
628bc3e5a1be ("l2tp: Support several sockets with same IP/port quadruple")
added support for several L2TPv2 tunnels using the same IP/port quadruple,
but if an L2TPv3 socket exists it could eat all the trafic. We thus have to
first use the version from the packet to get the proper tunnel, and only
then check that the version matches.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Reviewed-by: James Chapman <jchapman@katalix.com>
Link: https://lore.kernel.org/r/20240509205812.4063198-1-samuel.thibault@ens-lyon.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For type String and Binary we are currently usinig the exact-len
limit value as is without attempting any name resolution.
However, the spec may specify the name of a constant rather than an
actual value, which would result in using the constant name as is
and thus break the policy.
Ensure the limit value is passed to get_limit(), which will always
attempt resolving the name before printing the policy rule.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Link: https://lore.kernel.org/r/20240510232202.24051-1-a@unstable.cc
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Jurgens says:
====================
Add TX stop/wake counters
Several drivers provide TX stop and wake counters via ethtool stats. Add
those to the netdev queue stats, and use them in virtio_net.
====================
Link: https://lore.kernel.org/r/20240510201927.1821109-1-danielj@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TX queue stop and wake are counted by some drivers.
Support reporting these via netdev-genl queue stats.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240510201927.1821109-2-danielj@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A way for an application to know if an MPTCP connection fell back to TCP
is to use getsockopt(MPTCP_INFO) and look for errors. The issue with
this technique is that the same errors -- EOPNOTSUPP (IPv4) and
ENOPROTOOPT (IPv6) -- are returned if there was a fallback, *or* if the
kernel doesn't support this socket option. The userspace then has to
look at the kernel version to understand what the errors mean.
It is not clean, and it doesn't take into account older kernels where
the socket option has been backported. A cleaner way would be to expose
this info to the TCP socket level. In case of MPTCP socket where no
fallback happened, the socket options for the TCP level will be handled
in MPTCP code, in mptcp_getsockopt_sol_tcp(). If not, that will be in
TCP code, in do_tcp_getsockopt(). So MPTCP simply has to set the value
1, while TCP has to set 0.
If the socket option is not supported, one of these two errors will be
reported:
- EOPNOTSUPP (95 - Operation not supported) for MPTCP sockets
- ENOPROTOOPT (92 - Protocol not available) for TCP sockets, e.g. on the
socket received after an 'accept()', when the client didn't request to
use MPTCP: this socket will be a TCP one, even if the listen socket
was an MPTCP one.
With this new option, the kernel can return a clear answer to both "Is
this kernel new enough to tell me the fallback status?" and "If it is
new enough, is it currently a TCP or MPTCP socket?" questions, while not
breaking the previous method.
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240509-upstream-net-next-20240509-mptcp-tcp_is_mptcp-v1-1-f846df999202@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Richard Gobert says:
====================
net: gro: remove network_header use, move p->{flush/flush_id} calculations to L4
The cb fields network_offset and inner_network_offset are used instead of
skb->network_header throughout GRO.
These fields are then leveraged in the next commit to remove flush_id state
from napi_gro_cb, and stateful code in {ipv6,inet}_gro_receive which may be
unnecessarily complicated due to encapsulation support in GRO. These fields
are checked in L4 instead.
3rd patch adds tests for different flush_id flows in GRO.
====================
Link: https://lore.kernel.org/r/20240509190819.2985-1-richardbgobert@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Added flush id selftests to test different cases where DF flag is set or
unset and id value changes in the following packets. All cases where the
packets should coalesce or should not coalesce are tested.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-4-richardbgobert@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
iph->id, ...) against all packets in a loop. These flush checks are used in
all merging UDP and TCP flows.
These checks need to be done only once and only against the found p skb,
since they only affect flush and not same_flow.
This patch leverages correct network header offsets from the cb for both
outer and inner network headers - allowing these checks to be done only
once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
more declarative and contained in inet_gro_flush, thus removing the need
for flush_id in napi_gro_cb.
This results in less parsing code for non-loop flush tests for TCP and UDP
flows.
To make sure results are not within noise range - I've made netfilter drop
all TCP packets, and measured CPU performance in GRO (in this case GRO is
responsible for about 50% of the CPU utilization).
perf top while replaying 64 parallel IP/TCP streams merging in GRO:
(gro_receive_network_flush is compiled inline to tcp_gro_receive)
net-next:
6.94% [kernel] [k] inet_gro_receive
3.02% [kernel] [k] tcp_gro_receive
patch applied:
4.27% [kernel] [k] tcp_gro_receive
4.22% [kernel] [k] inet_gro_receive
perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
results for any encapsulation, in this case inet_gro_receive is top
offender in net-next)
net-next:
10.09% [kernel] [k] inet_gro_receive
2.08% [kernel] [k] tcp_gro_receive
patch applied:
6.97% [kernel] [k] inet_gro_receive
3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-3-richardbgobert@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch converts references of skb->network_header to napi_gro_cb's
network_offset and inner_network_offset.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-2-richardbgobert@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Arinzon says:
====================
ENA driver changes May 2024
This patchset contains several misc and minor
changes to the ENA driver.
====================
Link: https://lore.kernel.org/r/20240512134637.25299-1-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For the purpose of obtaining better CPU utilization,
minimum rx moderation interval is set to 20 usec.
Signed-off-by: Osama Abboud <osamaabb@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-6-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
strscpy copies as much of the string as possible,
meaning that the destination string will be truncated
in case of no space. As this is a non-critical error in
our case, adding a debug level print for indication.
This patch also removes a -1 which was added to ensure
enough space for NUL, but strscpy destination string is
guaranteed to be NUL-terminted, therefore, the -1 is
not needed.
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-5-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Validate that `first` flag is set only for the first
descriptor in multi-buffer packets.
In case of an invalid descriptor, a reset will occur.
A new reset reason for RX data corruption has been added.
Signed-off-by: Shahar Itzko <itzko@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-4-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch makes two changes in order to fill holes and
reduce ther overall size of the structures ena_com_dev
and ena_com_rx_ctx.
Signed-off-by: Shahar Itzko <itzko@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-3-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a counter to the ena_adapter struct in
order to keep track of reset failures.
The counter is incremented every time either ena_restore_device()
or ena_destroy_device() fail.
Signed-off-by: Osama Abboud <osamaabb@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-2-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that this test runs in netdev CI it looks like 10s isn't enough
for debug kernels:
selftests: net/netfilter: nft_flowtable.sh
2024/05/10 20:33:08 socat[12204] E write(7, 0x563feb16a000, 8192): Broken pipe
FAIL: file mismatch for ns1 -> ns2
-rw------- 1 root root 37345280 May 10 20:32 /tmp/tmp.Am0yEHhNqI
...
Looks like socat gets zapped too quickly, so increase timeout to 1m.
Could also reduce tx file size for KSFT_MACHINE_SLOW, but its preferrable
to have same test for both debug and nondebug.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240511064814.561525-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Joachim kindly merged the IPv6 support in
https://github.com/troglobit/mtools/pull/2, so we can just use his
version now. A few more fixes subsequently came in for IPv6, so even
better.
Check that the deployed mtools version is 3.0 or above. Note that the
version check breaks compatibility with my fork where I didn't bump the
version, but I assume that won't be a problem.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20240510112856.1262901-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Setting LED_OFF via brightness_set should deactivate hw control, so make
sure netdev trigger rules also get cleared in that case.
This fixes unwanted restoration of the default netdev trigger rules and
matches the behaviour when using the 'netdev' trigger without any
hardware offloading.
Fixes: 71e79430117d ("net: phy: air_en8811h: Add the Airoha EN8811H PHY driver")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://lore.kernel.org/r/5ed8ea615890a91fa4df59a7ae8311bbdf63cdcf.1715248281.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmZA578ACgkQ1V2XiooU
IOS18g//Zyuv+23GcUM7+FEXrlMN658xJyWiYvKjOaZZx5ZiV0QdZc4cbfPFD44p
qZBMmVC/WVoC89SLdwH1W47KoJU0xsK3/OdqGHHeNJ69111wIQMpOLfAetS2K0mb
F+Ue2vyWg1GQDICGsCdenHX7ihtVvnJJkomxc+3ObxtLCNsb2Dsr6JM5hMVP5Bil
4UZnPsrgfWy3A8O92burlPVE1sTWDFfFUGIf8geJc4QadwkgufkzxMhXNO7xHlpG
EZ99s8FPyD3R6tRPjf4gwdjr7JjinrdrYjZDuS4d3Uv8pKlUqcx8PgXG51/unr/y
qlynLXtEc1QU6SO2jENosHAG2/LQG2zsYEiiLFCP+a1JOtOxevZQKx8MyAFW8xDX
+RQhcBpTBocIyJ/tCDoM9lp69iYTR196Ct48v6pSGMNhZcddT4K4BkUL47GEs8T3
IA5x8h5gV2Q9ECMgqSaycdUsfLNgE/6fWx0ROs/wo3tMsgWrXCSJi8RFtN1sNbIO
rfuNnQiETIFBkQxBi7um8jadxdfIHm65cjgZBCyVbNNml3JwjYvXLxCXt2G7LxC4
Sg4nZvIbqWIifoMc1aQKypvFZjzzsWtFmYCuEUVLrnpj2SFTyh5CNzNo3MlHf7LG
sRb/XubdY6e0spLzd5VDjwH5qOT3poWAccatRr5BVUarxXCD5Vs=
=RD9p
-----END PGP SIGNATURE-----
Merge tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
Patch #1 skips transaction if object type provides no .update interface.
Patch #2 skips NETDEV_CHANGENAME which is unused.
Patch #3 enables conntrack to handle Multicast Router Advertisements and
Multicast Router Solicitations from the Multicast Router Discovery
protocol (RFC4286) as untracked opposed to invalid packets.
From Linus Luessing.
Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of
dropping them, from Jason Xing.
Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0,
also from Jason.
Patch #6 removes reference in netfilter's sysctl documentation on pickup
entries which were already removed by Florian Westphal.
Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which
allows to evict entries from the conntrack table,
also from Florian.
Patches #8 to #16 updates nf_tables pipapo set backend to allocate
the datastructure copy on-demand from preparation phase,
to better deal with OOM situations where .commit step is too late
to fail. Series from Florian Westphal.
Patch #17 adds a selftest with packetdrill to cover conntrack TCP state
transitions, also from Florian.
Patch #18 use GFP_KERNEL to clone elements from control plane to avoid
quick atomic reserves exhaustion with large sets, reporter refers
to million entries magnitude.
* tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: allow clone callbacks to sleep
selftests: netfilter: add packetdrill based conntrack tests
netfilter: nft_set_pipapo: remove dirty flag
netfilter: nft_set_pipapo: move cloning of match info to insert/removal path
netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone
netfilter: nft_set_pipapo: merge deactivate helper into caller
netfilter: nft_set_pipapo: prepare walk function for on-demand clone
netfilter: nft_set_pipapo: prepare destroy function for on-demand clone
netfilter: nft_set_pipapo: make pipapo_clone helper return NULL
netfilter: nft_set_pipapo: move prove_locking helper around
netfilter: conntrack: remove flowtable early-drop test
netfilter: conntrack: documentation: remove reference to non-existent sysctl
netfilter: use NF_DROP instead of -NF_DROP
netfilter: conntrack: dccp: try not to drop skb in conntrack
netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery
netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler
netfilter: nf_tables: skip transaction if update object is not implemented
====================
Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[Changes from V1:
- The __compat_break has been abandoned in favor of
a more readable can_loop macro that can be used anywhere, including
loop conditions.]
The macro list_for_each_entry is defined in bpf_arena_list.h as
follows:
#define list_for_each_entry(pos, head, member) \
for (void * ___tmp = (pos = list_entry_safe((head)->first, \
typeof(*(pos)), member), \
(void *)0); \
pos && ({ ___tmp = (void *)pos->member.next; 1; }); \
cond_break, \
pos = list_entry_safe((void __arena *)___tmp, typeof(*(pos)), member))
The macro cond_break, in turn, expands to a statement expression that
contains a `break' statement. Compound statement expressions, and the
subsequent ability of placing statements in the header of a `for'
loop, are GNU extensions.
Unfortunately, clang implements this GNU extension differently than
GCC:
- In GCC the `break' statement is bound to the containing "breakable"
context in which the defining `for' appears. If there is no such
context, GCC emits a warning: break statement without enclosing `for'
o `switch' statement.
- In clang the `break' statement is bound to the defining `for'. If
the defining `for' is itself inside some breakable construct, then
clang emits a -Wgcc-compat warning.
This patch adds a new macro can_loop to bpf_experimental, that
implements the same logic than cond_break but evaluates to a boolean
expression. The patch also changes all the current instances of usage
of cond_break withing the header of loop accordingly.
Tested in bpf-next master.
No regressions.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Cc: david.faust@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Link: https://lore.kernel.org/r/20240511212243.23477-1-jose.marchesi@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The BPF selftest global_func10 in progs/test_global_func10.c contains:
struct Small {
long x;
};
struct Big {
long x;
long y;
};
[...]
__noinline int foo(const struct Big *big)
{
if (!big)
return 0;
return bpf_get_prandom_u32() < big->y;
}
[...]
SEC("cgroup_skb/ingress")
__failure __msg("invalid indirect access to stack")
int global_func10(struct __sk_buff *skb)
{
const struct Small small = {.x = skb->len };
return foo((struct Big *)&small) ? 1 : 0;
}
GCC emits a "maybe uninitialized" warning for the code above, because
it knows `foo' accesses `big->y'.
Since the purpose of this selftest is to check that the verifier will
fail on this sort of invalid memory access, this patch just silences
the compiler warning.
Tested in bpf-next master.
No regressions.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Cc: david.faust@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240511212349.23549-1-jose.marchesi@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The BPF selftest test_global_func9.c performs type punning and breaks
srict-aliasing rules.
In particular, given:
int global_func9(struct __sk_buff *skb)
{
int result = 0;
[...]
{
const struct C c = {.x = skb->len, .y = skb->family };
result |= foo((const struct S *)&c);
}
}
When building with strict-aliasing enabled (the default) the
initialization of `c' gets optimized away in its entirely:
[... no initialization of `c' ...]
r1 = r10
r1 += -40
call foo
w0 |= w6
Since GCC knows that `foo' accesses s->x, we get a "maybe
uninitialized" warning.
On the other hand, when strict-aliasing is disabled GCC only optimizes
away the store to `.y':
r1 = *(u32 *) (r6+0)
*(u32 *) (r10+-40) = r1 ; This is .x = skb->len in `c'
r1 = r10
r1 += -40
call foo
w0 |= w6
In this case the warning is not emitted, because s-> is initialized.
This patch disables strict aliasing in this test when building with
GCC. clang seems to not optimize this particular code even when
strict aliasing is enabled.
Tested in bpf-next master.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Cc: david.faust@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240511212213.23418-1-jose.marchesi@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The strdup() function returns a pointer to a new string which is a
duplicate of the string "ifname". Memory for the new string is obtained
with malloc(), and need to be freed with free().
This patch adds this missing "free(saved_hwtstamp_ifname)" in cleanup()
to avoid a potential memory leak in xdp_hw_metadata.c.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/af9bcccb96655e82de5ce2b4510b88c9c8ed5ed0.1715417367.git.tanggeliang@kylinos.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch corrects a few warnings to allow selftests to compile for
GCC.
-- progs/cpumask_failure.c --
progs/bpf_misc.h:136:22: error: ‘cpumask’ is used uninitialized
[-Werror=uninitialized]
136 | #define __sink(expr) asm volatile("" : "+g"(expr))
| ^~~
progs/cpumask_failure.c:68:9: note: in expansion of macro ‘__sink’
68 | __sink(cpumask);
The macro __sink(cpumask) with the '+' contraint modifier forces the
the compiler to expect a read and write from cpumask. GCC detects
that cpumask is never initialized and reports an error.
This patch removes the spurious non required definitions of cpumask.
-- progs/dynptr_fail.c --
progs/dynptr_fail.c:1444:9: error: ‘ptr1’ may be used uninitialized
[-Werror=maybe-uninitialized]
1444 | bpf_dynptr_clone(&ptr1, &ptr2);
Many of the tests in the file are related to the detection of
uninitialized pointers by the verifier. GCC is able to detect possible
uninitialized values, and reports this as an error.
The patch initializes all of the previous uninitialized structs.
-- progs/test_tunnel_kern.c --
progs/test_tunnel_kern.c:590:9: error: array subscript 1 is outside
array bounds of ‘struct geneve_opt[1]’ [-Werror=array-bounds=]
590 | *(int *) &gopt.opt_data = bpf_htonl(0xdeadbeef);
| ^~~~~~~~~~~~~~~~~~~~~~~
progs/test_tunnel_kern.c:575:27: note: at offset 4 into object ‘gopt’ of
size 4
575 | struct geneve_opt gopt;
This tests accesses beyond the defined data for the struct geneve_opt
which contains as last field "u8 opt_data[0]" which clearly does not get
reserved space (in stack) in the function header. This pattern is
repeated in ip6geneve_set_tunnel and geneve_set_tunnel functions.
GCC is able to see this and emits a warning.
The patch introduces a local struct that allocates enough space to
safely allow the write to opt_data field.
-- progs/jeq_infer_not_null_fail.c --
progs/jeq_infer_not_null_fail.c:21:40: error: array subscript ‘struct
bpf_map[0]’ is partly outside array bounds of ‘struct <anonymous>[1]’
[-Werror=array-bounds=]
21 | struct bpf_map *inner_map = map->inner_map_meta;
| ^~
progs/jeq_infer_not_null_fail.c:14:3: note: object ‘m_hash’ of size 32
14 | } m_hash SEC(".maps");
This example defines m_hash in the context of the compilation unit and
casts it to struct bpf_map which is much smaller than the size of struct
bpf_map. It errors out in GCC when it attempts to access an element that
would be defined in struct bpf_map outsize of the defined limits for
m_hash.
This patch disables the warning through a GCC pragma.
This changes were tested in bpf-next master selftests without any
regressions.
Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Cc: jose.marchesi@oracle.com
Cc: david.faust@oracle.com
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Link: https://lore.kernel.org/r/20240510183850.286661-2-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch fixes an integer overflow warning raised by GCC in
xdp_prognum1 of progs/test_xdp_vlan.c:
GCC-BPF [test_maps] test_xdp_vlan.bpf.o
progs/test_xdp_vlan.c: In function 'xdp_prognum1':
progs/test_xdp_vlan.c:163:25: error: integer overflow in expression
'(short int)(((__builtin_constant_p((int)vlan_hdr->h_vlan_TCI)) != 0
? (int)(short unsigned int)((short int)((int)vlan_hdr->h_vlan_TCI
<< 8 >> 8) << 8 | (short int)((int)vlan_hdr->h_vlan_TCI << 0 >> 8
<< 0)) & 61440 : (int)__builtin_bswap16(vlan_hdr->h_vlan_TCI)
& 61440) << 8 >> 8) << 8' of type 'short int' results in '0' [-Werror=overflow]
163 | bpf_htons((bpf_ntohs(vlan_hdr->h_vlan_TCI) & 0xf000)
| ^~~~~~~~~
The problem lies with the expansion of the bpf_htons macro and the
expression passed into it. The bpf_htons macro (and similarly the
bpf_ntohs macro) expand to a ternary operation using either
__builtin_bswap16 or ___bpf_swab16 to swap the bytes, depending on
whether the expression is constant.
For an expression, with 'value' as a u16, like:
bpf_htons (value & 0xf000)
The entire (value & 0xf000) is 'x' in the expansion of ___bpf_swab16
and we get as one part of the expanded swab16:
((__u16)(value & 0xf000) << 8 >> 8 << 8
This will always evaluate to 0, which is intentional since this
subexpression deals with the byte guaranteed to be 0 by the mask.
However, GCC warns because the precise reason this always evaluates to 0
is an overflow. Specifically, the plain 0xf000 in the expression is a
signed 32-bit integer, which causes 'value' to also be promoted to a
signed 32-bit integer, and the combination of the 8-bit left shift and
down-cast back to __u16 results in a signed overflow (really a 'warning:
overflow in conversion from int to __u16' which is propegated up through
the rest of the expression leading to the ultimate overflow warning
above), which is a valid warning despite being the intended result of
this code.
Clang does not warn on this case, likely because it performs constant
folding later in the compilation process relative to GCC. It seems that
by the time clang does constant folding for this expression, the side of
the ternary with this overflow has already been discarded.
Fortunately, this warning is easily silenced by simply making the 0xf000
mask explicitly unsigned. This has no impact on the result.
Signed-off-by: David Faust <david.faust@oracle.com>
Cc: jose.marchesi@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240508193512.152759-1-david.faust@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>