If 'force' is false, it will keep the memory pages and all data
structures for the context memory type if the memory is valid.
This patch always passes true for the 'force' parameter so there is
no change in behavior. Later patches will adjust the 'force' parameter
for the FW log context memory types so that the logs will not be reset
after FW reset.
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241115151438.550106-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new function bnxt_free_one_ctx_mem() to free one context
memory type. bnxt_free_ctx_mem() now calls the new function in
the loop to free each context memory type. There is no change in
behavior. Later patches will further make use of the new function.
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241115151438.550106-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new bit to struct bnxt_ctx_mem_type to indicate that host
memory has been successfully allocated for this context memory type.
In the next patches, we'll be adding some additional context memory
types for FW debugging/logging. If memory cannot be allocated for
any of these new types, we will not abort and the cleared mem_valid
bit will indicate to skip configuring the memory type.
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Shruti Parab <shruti.parab@broadcom.com>
Signed-of-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241115151438.550106-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The major change is the new firmware command to flush the FW debug
logs to the host backing store context memory buffers.
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241115151438.550106-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason A. Donenfeld says:
====================
wireguard updates and fixes for 6.13
This tiny series (+3/-2) fixes one bug and has three small improvements.
1) Fix running the netns.sh test suite on systems that haven't yet
inserted the nf_conntrack module.
2) Remove a stray useless function call in a selftest.
3) There's no need to zero out the netdev private data in recent
kernels.
4) Set the TSO max size to be GSO_MAX_SIZE, so that we aggregate larger
packets. Daniel reports seeing a 15% improvement in a simple load and
suggested the speedups would be even better in more complex loads.
====================
Link: https://patch.msgid.link/20241117212030.629159-1-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Advertise GSO_MAX_SIZE as TSO max size in order support BIG TCP for wireguard.
This helps to improve wireguard performance a bit when enabled as it allows
wireguard to aggregate larger skbs in wg_packet_consume_data_done() via
napi_gro_receive(), but also allows the stack to build larger skbs on xmit
where the driver then segments them before encryption inside wg_xmit().
We've seen a 15% improvement in TCP stream performance.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://patch.msgid.link/20241117212030.629159-5-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some distros may not load nf_conntrack by default, which will cause
subsequent nf_conntrack sets to fail. Load this module if it is not
already loaded.
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
[ Jason: add [[ -e ... ]] check so this works in the qemu harness. ]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://patch.msgid.link/20241117212030.629159-4-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit fixes a useless call issue detected by Coverity (CID
1508092). The call to horrible_allowedips_lookup_v4 is unnecessary as
its return value is never checked.
Signed-off-by: Dheeraj Reddy Jonnalagadda <dheeraj.linuxdev@gmail.com>
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://patch.msgid.link/20241117212030.629159-3-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The memory for netdev_priv is allocated using kvzalloc in
alloc_netdev_mqs before rtnl_link_ops->setup is called so there is no
need to zero it again in wg_setup.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://patch.msgid.link/20241117212030.629159-2-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fun_create_queue was added in 2022 by
commit e1ffcc66818f ("net/fungible: Add service module for Fungible
drivers")
but hasn't been used.
Remove it.
Also remove the static helper functions it was the only user of.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kees Cook says:
====================
UAPI: ethtool: Avoid flex-array in struct ethtool_link_settings
This reverts the tagged struct group in struct ethtool_link_settings and
instead just removes the flexible array member from Linux's view as it
is entirely unused.
====================
Link: https://patch.msgid.link/20241115204115.work.686-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
struct ethtool_link_settings tends to be used as a header for other
structures that have trailing bytes[1], but has a trailing flexible array
itself. Using this overlapped with other structures leads to ambiguous
object sizing in the compiler, so we want to avoid such situations (which
have caused real bugs in the past). Detecting this can be done with
-Wflex-array-member-not-at-end, which will need to be enabled globally.
Using a tagged struct_group() to create a new ethtool_link_settings_hdr
structure isn't possible as it seems we cannot use the tagged variant of
struct_group() due to syntax issues from C++'s perspective (even within
"extern C")[2]. Instead, we can just leave the offending member defined
in UAPI and remove it from the kernel's view of the structure, as Linux
doesn't actually use this member at all. There is also no change in
size since it was already a flexible array that didn't contribute to
size returned by any use of sizeof().
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/lkml/20241109100213.262a2fa0@kernel.org/ [2]
Link: https://lore.kernel.org/lkml/0bc2809fe2a6c11dd4c8a9a10d9bd65cccdb559b.1730238285.git.gustavoars@kernel.org/ [1]
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20241115204308.3821419-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 43d3487035e9a86fad952de4240a518614240d43. We cannot
use tagged struct groups in UAPI because C++ will throw syntax errors
even under "extern C".
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20241115204308.3821419-2-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 3bd9b9abdf1563a22041b7255baea6d449902f1a. We cannot
use the new tagged struct group because it throws C++ errors even under
"extern C".
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20241115204308.3821419-1-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bpf_offload caught a spurious warning in TC recently, but the error
message did not provide enough information to know what the problem
is:
FAIL: Found 'netdevsim' in command output, leaky extack?
Add the extack to the output:
FAIL: Unexpected command output, leaky extack? ('netdevsim', 'Warning: Filter with specified priority/protocol not found.')
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commits for the SMC protocol usually get carried through the netdev
mailing list. Some portions use InfiniBand verbs that are discussed on
the RDMA mailing list. So run patches by that list too to increase the
likelihood that all interested parties can see them.
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts says:
====================
mptcp: pm: lockless list traversal and cleanup
Here are two patches improving the MPTCP in-kernel path-manager.
- Patch 1: the get and dump endpoints operations are iterating over the
endpoints list in a lockless way.
- Patch 2: reduce the code duplication to lookup an endpoint.
====================
Link: https://patch.msgid.link/20241115-net-next-mptcp-pm-lockless-dump-v1-0-f4a1bcb4ca2c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The helper __lookup_addr() can be used in mptcp_pm_nl_get_local_id()
and mptcp_pm_nl_is_backup() to simplify the code, and avoid code
duplication.
Co-developed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241115-net-next-mptcp-pm-lockless-dump-v1-2-f4a1bcb4ca2c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To return an endpoint to the userspace via Netlink, and to dump all of
them, the endpoint list was iterated while holding the pernet->lock, but
only to read the content of the list.
In these cases, the spin locks can be replaced by RCU read ones, and use
the _rcu variants to iterate over the entries list in a lockless way.
Note that the __lookup_addr_by_id() helper has been modified to use the
_rcu variants of list_for_each_entry(), but with an extra conditions, so
it can be called either while the RCU read lock is held, or when the
associated pernet->lock is held.
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241115-net-next-mptcp-pm-lockless-dump-v1-1-f4a1bcb4ca2c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver’s compatibility with devices is confirmed earlier in
platform_match(). Since reaching probe means the device is valid,
the extra check can be removed to simplify the code.
Signed-off-by: Vitalii Mordan <mordan@ispras.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the GPIO interrupt controller is always not working properly, we need
to constantly add workaround to cope with hardware deficiencies. So just
remove GPIO interrupt controller, and let the SFP driver poll the GPIO
status.
Fixes: b4a2496c17ed ("net: txgbe: fix GPIO interrupt blocking")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20241115071527.1129458-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
eth: fbnic: cleanup and add a few stats
Cleanup trival problems with fbnic and add the PCIe and RPC (Rx parser)
stats.
All stats are read under rtnl_lock for now, so the code is pretty
trivial. We'll need to add more locking when we start gathering
drops used by .ndo_get_stats64.
====================
Link: https://patch.msgid.link/20241115015344.757567-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Report Rx parser statistics via ethtool -S.
The parser stats are 32b, so we need to add refresh to the service
task to make sure we don't miss overflows.
Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241115015344.757567-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add PCIe hardware statistics support to the fbnic driver. These stats
provide insight into PCIe transaction performance and error conditions.
Which includes, read/write and completion TLP counts and DWORD counts and
debug counters for tag, completion credit and NP credit exhaustion
The stats are exposed via debugfs and can be used to monitor PCIe
performance and debug PCIe issues.
Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241115015344.757567-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the usual debugfs structure:
fbnic/
$pci-id/
device-fileA
device-fileB
This patch only adds the directories, subsequent changes
will add files.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20241115015344.757567-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While adding the SPDX headers I noticed we're also missing
a header guard.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20241115015344.757567-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo noticed that we are missing SPDX headers, add them.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20241115015344.757567-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sanity checks are going to get silently cast to unsigned
and always pass. Cast the sizeof to signed size.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241115003248.733862-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 51183d233b5a ("net/neighbor: Update neigh_dump_info for strict
data checking") added strict checking. The err variable is not cleared,
so if we find no table to dump we will return the validation error even
if user did not want strict checking.
I think the only way to hit this is to send an buggy request, and ask
for a table which doesn't exist, so there's no point treating this
as a real fix. I only noticed it because a syzbot repro depended on it
to trigger another bug.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241115003221.733593-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since '1 << rocker_port->pport' may be undefined for port >= 32,
cast the left operand to 'unsigned long long' like it's done in
'rocker_port_set_enable()' above. Compile tested only.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://patch.msgid.link/20241114151946.519047-1-dmantipov@yandex.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Because optimizing the power consumption of t7XX,
change auto suspend time to 5000.
The Tests uses a script to loop through the power_state
of t7XX.
(for example: /sys/bus/pci/devices/0000\:72\:00.0/power_state)
* If Auto suspend is 20 seconds,
test script show power_state have 0~5% of the time was in D3 state
when host don't have data packet transmission.
* Changed auto suspend time to 5 seconds,
test script show power_state have 50%~80% of the time was in D3 state
when host don't have data packet transmission.
We tested Fibocom FM350 and our products using the t7xx and they all
benefited from this.
Signed-off-by: Jack Wu <wojackbb@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Link: https://patch.msgid.link/20241114102002.481081-1-wojackbb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Binder places its headers under include/uapi/linux/android/
Make sure replace / with _ in the uAPI header guard, the c_upper()
is more strict and only converts - to _. This is likely a good
constraint to have, to enforce sane naming in enums etc.
But paths may include /.
Signed-off-by: Li Li <dualli@google.com>
Link: https://patch.msgid.link/20241113193239.2113577-2-dualli@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
"micrel,led-mode" is not yet documented by a schema. It's irrelevant to
the example, so just drop it.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20241113225742.1784723-2-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
AM65 CPSW hardware can map the 6-bit DSCP/TOS field to
appropriate priority queue via DSCP to Priority mapping registers
(CPSW_PN_RX_PRI_MAP_REG).
Use a default DSCP to User Priority (UP) mapping as per
https://datatracker.ietf.org/doc/html/rfc8325#section-4.3
and
https://datatracker.ietf.org/doc/html/rfc8622#section-11
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE802.1Q-2014 supersedes IEEE802.1D-2004. Now Priority Code Point (PCP)
2 is no longer at a lower priority than PCP 0. PCP 1 (Background) is still
at a lower priority than PCP 0 (Best Effort).
Reference:
IEEE802.1Q-2014, Standard for Local and metropolitan area networks
Table I-2 - Traffic type acronyms
Table I-3 - Defining traffic types
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Philo Lu says:
====================
udp: Add 4-tuple hash for connected sockets
This patchset introduces 4-tuple hash for connected udp sockets, to make
connected udp lookup faster.
Stress test results (with 1 cpu fully used) are shown below, in pps:
(1) _un-connected_ socket as server
[a] w/o hash4: 1,825176
[b] w/ hash4: 1,831750 (+0.36%)
(2) 500 _connected_ sockets as server
[c] w/o hash4: 290860 (only 16% of [a])
[d] w/ hash4: 1,889658 (+3.1% compared with [b])
With hash4, compute_score is skipped when lookup, so [d] is slightly
better than [b].
Patch1: Add a new counter for hslot2 named hash4_cnt, to avoid cache line
miss when lookup.
Patch2: Add hslot/hlist_nulls for 4-tuple hash.
Patch3 and 4: Implement 4-tuple hash for ipv4 and ipv6.
The detailed motivation is described in Patch 3.
The 4-tuple hash increases the size of udp_sock and udp_hslot. Thus add it
with CONFIG_BASE_SMALL, i.e., it's a no op with CONFIG_BASE_SMALL.
Intentionally, the feature is not available for udplite. Though udplite
shares some structs and functions with udp, its connect() keeps unchanged.
So all udplite sockets perform the same as un-connected udp sockets.
Besides, udplite also shares the additional memory consumption in udp_sock
and udptable.
changelogs:
v8 -> v9 (Paolo Abeni):
- Add explanation about udplite in cover letter
- Update tags for co-developers
- Add acked-by tags of Paolo and Willem
v7 -> v8:
- add EXPORT_SYMBOL for ipv6.ko build
v6 -> v7 (Kuniyuki Iwashima):
- export udp_ehashfn to be used by udpv6 rehash
v5 -> v6 (Paolo Abeni):
- move udp_table_hash4_init from patch2 to patch1
- use hlist_nulls for lookup-rehash race
- add test results in commit log
- add more comment, e.g., for rehash4 used in hash4
- add ipv6 support (Patch4), and refactor some functions for better
sharing, without functionality change
v4 -> v5 (Paolo Abeni):
- add CONFIG_BASE_SMALL with which udp hash4 does nothing
v3 -> v4 (Willem de Bruijn):
- fix mistakes in udp_pernet_table_alloc()
RFCv2 -> v3 (Gur Stavi):
- minor fix in udp_hashslot2() and udp_table_init()
- add rcu sync in rehash4()
RFCv1 -> RFCv2:
- add a new struct for hslot2
- remove the sockopt UDP_HASH4 because it has little side effect for
unconnected sockets
- add rehash in connect()
- re-organize the patch into 3 smaller ones
- other minor fix
v8:
https://lore.kernel.org/all/20241108054836.123484-1-lulie@linux.alibaba.com/
v7:
https://lore.kernel.org/all/20241105121225.12513-1-lulie@linux.alibaba.com/
v6:
https://lore.kernel.org/all/20241031124550.20227-1-lulie@linux.alibaba.com/
v5:
https://lore.kernel.org/all/20241018114535.35712-1-lulie@linux.alibaba.com/
v4:
https://lore.kernel.org/all/20241012012918.70888-1-lulie@linux.alibaba.com/
v3:
https://lore.kernel.org/all/20241010090351.79698-1-lulie@linux.alibaba.com/
RFCv2:
https://lore.kernel.org/all/20240924110414.52618-1-lulie@linux.alibaba.com/
RFCv1:
https://lore.kernel.org/all/20240913100941.8565-1-lulie@linux.alibaba.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement ipv6 udp hash4 like that in ipv4. The major difference is that
the hash value should be calculated with udp6_ehashfn(). Besides,
ipv4-mapped ipv6 address is handled before hash() and rehash(). Export
udp_ehashfn because now we use it in udpv6 rehash.
Core procedures of hash/unhash/rehash are same as ipv4, and udpv4 and
udpv6 share the same udptable, so some functions in ipv4 hash4 can also
be shared.
Co-developed-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Co-developed-by: Fred Chen <fred.cc@alibaba-inc.com>
Signed-off-by: Fred Chen <fred.cc@alibaba-inc.com>
Co-developed-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the udp_table has two hash table, the port hash and portaddr
hash. Usually for UDP servers, all sockets have the same local port and
addr, so they are all on the same hash slot within a reuseport group.
In some applications, UDP servers use connect() to manage clients. In
particular, when firstly receiving from an unseen 4 tuple, a new socket
is created and connect()ed to the remote addr:port, and then the fd is
used exclusively by the client.
Once there are connected sks in a reuseport group, udp has to score all
sks in the same hash2 slot to find the best match. This could be
inefficient with a large number of connections, resulting in high
softirq overhead.
To solve the problem, this patch implement 4-tuple hash for connected
udp sockets. During connect(), hash4 slot is updated, as well as a
corresponding counter, hash4_cnt, in hslot2. In __udp4_lib_lookup(),
hslot4 will be searched firstly if the counter is non-zero. Otherwise,
hslot2 is used like before. Note that only connected sockets enter this
hash4 path, while un-connected ones are not affected.
hlist_nulls is used for hash4, because we probably move to another hslot
wrongly when lookup with concurrent rehash. Then we check nulls at the
list end to see if we should restart lookup. Because udp does not use
SLAB_TYPESAFE_BY_RCU, we don't need to touch sk_refcnt when lookup.
Stress test results (with 1 cpu fully used) are shown below, in pps:
(1) _un-connected_ socket as server
[a] w/o hash4: 1,825176
[b] w/ hash4: 1,831750 (+0.36%)
(2) 500 _connected_ sockets as server
[c] w/o hash4: 290860 (only 16% of [a])
[d] w/ hash4: 1,889658 (+3.1% compared with [b])
With hash4, compute_score is skipped when lookup, so [d] is slightly
better than [b].
Co-developed-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Co-developed-by: Fred Chen <fred.cc@alibaba-inc.com>
Signed-off-by: Fred Chen <fred.cc@alibaba-inc.com>
Co-developed-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new hash list, hash4, in udp table. It will be used to implement
4-tuple hash for connected udp sockets. This patch adds the hlist to
table, and implements helpers and the initialization. 4-tuple hash is
implemented in the following patch.
hash4 uses hlist_nulls to avoid moving wrongly onto another hlist due to
concurrent rehash, because rehash() can happen with lookup().
Co-developed-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Co-developed-by: Fred Chen <fred.cc@alibaba-inc.com>
Signed-off-by: Fred Chen <fred.cc@alibaba-inc.com>
Co-developed-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preparing for udp 4-tuple hash (uhash4 for short).
To implement uhash4 without cache line missing when lookup, hslot2 is
used to record the number of hashed sockets in hslot4. Thus adding a new
struct udp_hslot_main with field hash4_cnt, which is used by hash2. The
new struct is used to avoid doubling the size of udp_hslot.
Before uhash4 lookup, firstly checking hash4_cnt to see if there are
hashed sks in hslot4. Because hslot2 is always used in lookup, there is
no cache line miss.
Related helpers are updated, and use the helpers as possible.
uhash4 is implemented in following patches.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmc3A/gACgkQrB3Eaf9P
W7fNew//XCIhIvFYaQcP2x84T4EYB679NkGlwMATxXgn40+sp7muSwVweynEWNIu
FltfBAwYD/MxD7g519abVPMWXs/iYI5duw3vvqnxmkOoebWLLocg2VoqFIdVXlQw
/hj+1X/oNT4OKcaQAw/FAGRuYvkc90YB/rRG51RwAIR0tyBjRwfUsozMM8QX/zQI
I0cLCgGAf/kylQre+dhvUkMhXaLogMF5v0qzPxhyMBD02JaUpe6+5cdHQcmKOhqa
ksTpySYnIKIHZrLizeFGDZpinaDIph20vGaDvDXpqTYFuwvCQsZczJy02dF4otf2
2dZz6+2La+ZM+WsGIqpALqKCNhr8fOcQxCRH3eGLPBwoXXt5CFAMgJKob8hKuonW
FgJaYMBZOjYbgGah8WbEe/YsWq4y3uRs48pFtY+T5cn7AskNxIvUoLNjSS83Hlqu
PJbveiKsZygig966Q/zUFATYnvj3zEgjVEcSbK6LRyBXL79Njr8l+PZ0Zoz76tc4
bF1Xv0x+lRYmwa9rvOFaeqrP/GTe0xvlitFzuCN7HnXiN8URKnnDY2odkXYzo+Z7
MBbP8wR/CaoiAvdMw74116nAIFOW95LPtvdGJTvlS9jAOt1P7dWQ3/mFKEpItndv
cJjWzI7HKl0+85FcCDw+tmsDWWGbALUyPw96i8UgUcDGyqVKUgA=
=Ioo8
-----END PGP SIGNATURE-----
Merge tag 'ipsec-next-2024-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
ipsec-next-11-15
1) Add support for RFC 9611 per cpu xfrm state handling.
2) Add inbound and outbound xfrm state caches to speed up
state lookups.
3) Convert xfrm to dscp_t. From Guillaume Nault.
4) Fix error handling in build_aevent.
From Everest K.C.
5) Replace strncpy with strscpy_pad in copy_to_user_auth.
From Daniel Yang.
6) Fix an uninitialized symbol during acquire state insertion.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xuan Zhuo says:
====================
virtio-net: support AF_XDP zero copy (tx)
XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero
copy feature of xsk (XDP socket) needs to be supported by the driver. The
performance of zero copy is very good. mlx5 and intel ixgbe already support
this feature, This patch set allows virtio-net to support xsk's zerocopy xmit
feature.
At present, we have completed some preparation:
1. vq-reset (virtio spec and kernel code)
2. virtio-core premapped dma
3. virtio-net xdp refactor
So it is time for Virtio-Net to complete the support for the XDP Socket
Zerocopy.
Virtio-net can not increase the queue num at will, so xsk shares the queue with
kernel.
This patch set includes some refactor to the virtio-net to let that to support
AF_XDP.
The current configuration sets the virtqueue (vq) to premapped mode,
implying that all buffers submitted to this queue must be mapped ahead
of time. This presents a challenge for the virtnet send queue (sq): the
virtnet driver would be required to keep track of dma information for vq
size * 17, which can be substantial. However, if the premapped mode were
applied on a per-buffer basis, the complexity would be greatly reduced.
With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel
skb buffers could remain unmapped.
We can distinguish them by sg_page(sg), When sg_page(sg) is NULL, this
indicates that the driver has performed DMA mapping in advance, allowing
the Virtio core to directly utilize sg_dma_address(sg) without
conducting any internal DMA mapping. Additionally, DMA unmap operations
for this buffer will be bypassed.
ENV: Qemu with vhost-user(polling mode).
Host CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
testpmd> show port stats all
######################## NIC statistics for port 0 ########################
RX-packets: 19531092064 RX-missed: 0 RX-bytes: 1093741155584
RX-errors: 0
RX-nombuf: 0
TX-packets: 5959955552 TX-errors: 0 TX-bytes: 371030645664
Throughput (since last show)
Rx-pps: 8861574 Rx-bps: 3969985208
Tx-pps: 8861493 Tx-bps: 3969962736
############################################################################
testpmd> show port stats all
######################## NIC statistics for port 0 ########################
RX-packets: 68152727 RX-missed: 0 RX-bytes: 3816552712
RX-errors: 0
RX-nombuf: 0
TX-packets: 68114967 TX-errors: 33216 TX-bytes: 3814438152
Throughput (since last show)
Rx-pps: 6333196 Rx-bps: 2837272088
Tx-pps: 6333227 Tx-bps: 2837285936
############################################################################
But AF_XDP consumes more CPU for tx and rx napi(100% and 86%).
====================
Link: https://patch.msgid.link/20241112012928.102478-1-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now, we support AF_XDP(xsk). Add NETDEV_XDP_ACT_XSK_ZEROCOPY to
xdp_features.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20241112012928.102478-14-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If send queue sent some packets, we update the tx timeout
record to prevent the tx timeout.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20241112012928.102478-13-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver's tx napi is very important for XSK. It is responsible for
obtaining data from the XSK queue and sending it out.
At the beginning, we need to trigger tx napi.
virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk
buffer) by the last bits of the pointer.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20241112012928.102478-12-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since xsk's TX queue is consumed by TX NAPI, if sq is bound to xsk, then
we must stop tx napi from being disabled.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20241112012928.102478-11-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Because the af-xdp will introduce a new xmit type, so I refactor the
xmit type mechanism first.
We know both xdp_frame and sk_buff are at least 4 bytes aligned.
For the xdp tx, we do not pass any pointer to virtio core as data,
we just need to pass the len of the packet. So we will push len
to the void pointer. We can make sure the pointer is 4 bytes aligned.
And the data structure of AF_XDP also is at least 4 bytes aligned.
So the last two bits of the pointers are free, we can't use these to
distinguish them.
00 for skb
01 for SKB_ORPHAN
10 for XDP
11 for AF-XDP tx
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20241112012928.102478-9-xuanzhuo@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>