Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when
cloning an skb, save and restore truesize after pskb_expand_head(). This
can occur if the allocator decides to service an allocation of the same
size differently (e.g. use a different size class, or pass the
allocation on to KFENCE).
Because truesize is used for bookkeeping (such as sk_wmem_queued), a
modified truesize of a cloned skb may result in corrupt bookkeeping and
relevant warnings (such as in sk_stream_kill_queues()).
Link: https://lkml.kernel.org/r/X9JR/J6dMMOy1obu@elver.google.com
Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210201160420.2826895-1-elver@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With version 0 of the protocol it was legal to encode the 'Subflow Id' in
the MP_PRIO suboption, to specify which subflow would change its 'Backup'
flag. This has been removed from v1 specification: thus, according to RFC
8684 §3.3.8, the resulting 'Length' for MP_PRIO changed from 4 to 3 byte.
Current Linux generates / parses MP_PRIO according to the old spec, using
'Length' equal to 4, and hardcoding 1 as 'Subflow Id'; RFC compliance can
improve if we change 'Length' in other to become 3, leaving a 'Nop' after
the MP_PRIO suboption. In this way the kernel will emit and accept *only*
MP_PRIO suboptions that are compliant to version 1 of the MPTCP protocol.
unpatched 5.11-rc kernel:
[root@bottarga ~]# tcpdump -tnnr unpatched.pcap | grep prio
reading from file unpatched.pcap, link-type LINUX_SLL (Linux cooked v1)
dropped privs to tcpdump
IP 10.0.3.2.48433 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options [nop,nop,TS val 4032325513 ecr 1876514270,mptcp prio non-backup id 1,mptcp dss ack 14084896651682217737], length 0
patched 5.11-rc kernel:
[root@bottarga ~]# tcpdump -tnnr patched.pcap | grep prio
reading from file patched.pcap, link-type LINUX_SLL (Linux cooked v1)
dropped privs to tcpdump
IP 10.0.3.2.49735 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options [nop,nop,TS val 1276737699 ecr 2686399734,mptcp prio non-backup,nop,mptcp dss ack 18433038869082491686], length 0
Changes since v2:
- when accounting for option space, don't increment 'TCPOLEN_MPTCP_PRIO'
and use 'TCPOLEN_MPTCP_PRIO_ALIGN' instead, thanks to Matthieu Baerts.
Changes since v1:
- refactor patch to avoid using 'TCPOLEN_MPTCP_PRIO' with its old value,
thanks to Geliang Tang.
Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Matteo Croce <mcroce@linux.microsoft.com>
Link: https://lore.kernel.org/r/846cdd41e6ad6ec88ef23fee1552ab39c2f5a3d1.1612184361.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
trees.
Current release - regressions:
- ip_tunnel: fix mtu calculation
- mlx5: fix function calculation for page trees
Previous releases - regressions:
- vsock: fix the race conditions in multi-transport support
- neighbour: prevent a dead entry from updating gc_list
- dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
Previous releases - always broken:
- bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
cgroup getsockopt infra when user space is trying
to race against optlen, from Loris Reiff.
- bpf: add missing fput() in BPF inode storage map update helper
- udp: ipv4: manipulate network header of NATed UDP GRO fraglist
- mac80211: fix station rate table updates on assoc
- r8169: work around RTL8125 UDP HW bug
- igc: report speed and duplex as unknown when device is runtime
suspended
- rxrpc: fix deadlock around release of dst cached on udp tunnel
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmAZjwQACgkQMUZtbf5S
IruLbQ//Yg9+xEnqhDuOJZtYHB0rsJjLlKmtvgOsBr8BaTcUEPoPoqUPm+EMvCHb
o1fFa1qIrbS5luVEofu9hNX7DGXwvgawaMW2TympJhqLZQqjazCMB/st99LphhJw
RvaZI8aDOikosT4c+I0vm83jDQETonrjziIcPfHHPjn/Q+amGRRRXiTSQnRF/MlU
oARCG+U3kHsHBDUPNSCtSjKXshoZPjFb/pD7fQAlzzm7CssvbPhNWbducueyP2Fb
XW4RwJu9QBBH2JS6uZJ1Y6LVoRzusmE9dUam3KhkiL/CHs72lWPsc+Rn5gbBPvc5
Y4T4h61Xti1O4ULKdqhGceror6XY+4Qb1VlHWWztOhIo00wIAv3IHbTup/4o0HBr
j84MtcyOl/qxSFXjunPJkbWJngXikrkIMS0Bl6ZcPAejYM9wN6vCgbvFCHbEg1Rx
cWFnYyS9FCLduaxHSizv050tWhknOdX+zHK3fOtlW0yWnreJAB8Hoc21Zm7YKvg0
GxxcGK6AhqJ6s2ixVDv7MyJrltJ/hOJQb+T3HgHFuY2BYUs8F2r/HoHU/u4uCl76
RdBzbC/sLnBpMHf6r1rHTnGPsapoJOOYWnej71l425vX1qr5xnmxVNNB6HReObNv
+/jPoRYa5BVsVt2LmDcuH1O32pXJPWKVBR7Yfa6Bn2yzhcbECTc=
=ZByM
-----END PGP SIGNATURE-----
Merge tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
trees.
Current release - regressions:
- ip_tunnel: fix mtu calculation
- mlx5: fix function calculation for page trees
Previous releases - regressions:
- vsock: fix the race conditions in multi-transport support
- neighbour: prevent a dead entry from updating gc_list
- dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
Previous releases - always broken:
- bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
cgroup getsockopt infra when user space is trying to race against
optlen, from Loris Reiff.
- bpf: add missing fput() in BPF inode storage map update helper
- udp: ipv4: manipulate network header of NATed UDP GRO fraglist
- mac80211: fix station rate table updates on assoc
- r8169: work around RTL8125 UDP HW bug
- igc: report speed and duplex as unknown when device is runtime
suspended
- rxrpc: fix deadlock around release of dst cached on udp tunnel"
* tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits)
net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
net: ipa: fix two format specifier errors
net: ipa: use the right accessor in ipa_endpoint_status_skip()
net: ipa: be explicit about endianness
net: ipa: add a missing __iomem attribute
net: ipa: pass correct dma_handle to dma_free_coherent()
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
net: mvpp2: TCAM entry enable should be written after SRAM data
net: lapb: Copy the skb before sending a packet
net/mlx5e: Release skb in case of failure in tc update skb
net/mlx5e: Update max_opened_tc also when channels are closed
net/mlx5: Fix leak upon failure of rule creation
net/mlx5: Fix function calculation for page trees
docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc
udp: ipv4: manipulate network header of NATed UDP GRO fraglist
net: ip_tunnel: fix mtu calculation
vsock: fix the race conditions in multi-transport support
net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
ibmvnic: device remove has higher precedence over reset
...
sup_multicast_addr is passed to ether_addr_equal for address comparison
which casts the address inputs to u16 leading to an unaligned access.
Aligning the sup_multicast_addr to u16 boundary fixes the issue.
Signed-off-by: Andreas Oetken <andreas.oetken@siemens.com>
Link: https://lore.kernel.org/r/20210202090304.2740471-1-ennoerlangen@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot found WARNING in rds_rdma_extra_size [1] when RDS_CMSG_RDMA_ARGS
control message is passed with user-controlled
0x40001 bytes of args->nr_local, causing order >= MAX_ORDER condition.
The exact value 0x40001 can be checked with UIO_MAXIOV which is 0x400.
So for kcalloc() 0x400 iovecs with sizeof(struct rds_iovec) = 0x10
is the closest limit, with 0x10 leftover.
Same condition is currently done in rds_cmsg_rdma_args().
[1] WARNING: mm/page_alloc.c:5011
[..]
Call Trace:
alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267
alloc_pages include/linux/gfp.h:547 [inline]
kmalloc_order+0x2e/0xb0 mm/slab_common.c:837
kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853
kmalloc_array include/linux/slab.h:592 [inline]
kcalloc include/linux/slab.h:621 [inline]
rds_rdma_extra_size+0xb2/0x3b0 net/rds/rdma.c:568
rds_rm_size net/rds/send.c:928 [inline]
Reported-by: syzbot+1bd2b07f93745fa38425@syzkaller.appspotmail.com
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Link: https://lore.kernel.org/r/20210201203233.1324704-1-snovitoll@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When sending a packet, we will prepend it with an LAPB header.
This modifies the shared parts of a cloned skb, so we should copy the
skb rather than just clone it, before we prepend the header.
In "Documentation/networking/driver.rst" (the 2nd point), it states
that drivers shouldn't modify the shared parts of a cloned skb when
transmitting.
The "dev_queue_xmit_nit" function in "net/core/dev.c", which is called
when an skb is being sent, clones the skb and sents the clone to
AF_PACKET sockets. Because the LAPB drivers first remove a 1-byte
pseudo-header before handing over the skb to us, if we don't copy the
skb before prepending the LAPB header, the first byte of the packets
received on AF_PACKET sockets can be corrupted.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/20210201055706.415842-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- station rate tables were not updated correctly
after association, leading to bad configuration
- rtl8723bs (staging) was initializing data incorrectly
after the previous fix and needed to move the init
later
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmAZYesACgkQB8qZga/f
l8QOpA//V6cBc+ie4AoWKLU/isc33IiZG15qbwvXrHVouLaMgSg4SKQ3cQMwRKd7
Y6cDJynj4EkIo7RZlRMIo5Dsefm2sv6L41tDhfVwR8Z7+wOmGKSXsmYABpjWttaZ
4ABEqU2ZUPI3cm5iYF7qKKSNbwSqHeCWjlWnB3TFB8NfzTC+x7uSTQ/8C8/GZ7aF
tkELHgvdig9+FbdKOs52lmIneTYLEDqxMuv+65XtE9flaYvgWiXCj5ilVVDo8Tjd
vdCST8ux/9YIEcrhlM+SUM1OFO6AIqZ3EX5S2ZzJdc37PMDDy+nNr95cFrlN4EQ8
y9avIS0Z+mvw/R7KSDc7XKInpvleC7bzR9DZVQsF8hdV9iB0cmVKyPmASfGpft69
Ndv2+h2vmWvSHmJDpiroSvTY9WT+AgWCihOU/tj0PrKs+XNLUFfrO08BxFGnaRK/
+MXzXY7ZmgfU9BFgmlAS2ejRbqfb3V6F5qa2Obj+3gq/SbM9W4Jl8RHiiox7szse
GdLrT/LjvVEFC/cEMqDzvnGpnVosNkNtJRFMAaGyKs1g/uljl9A51HRZ8HdLrgv9
bVsMripcQX2JMMxqBwbyfdzPBE0MX8ExkMhyuFbdUyWGEWJqsz4+irr25Bhcyoge
RaRI6/xPM7DOkB9CDdbvJItBJ9GHYz6gvf+ZiIdu+ClpQ+b3k3s=
=mEgj
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Two fixes:
- station rate tables were not updated correctly
after association, leading to bad configuration
- rtl8723bs (staging) was initializing data incorrectly
after the previous fix and needed to move the init
later
* tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
staging: rtl8723bs: Move wiphy setup to after reading the regulatory settings from the chip
mac80211: fix station rate table updates on assoc
====================
Link: https://lore.kernel.org/r/20210202143505.37610-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
forwarding. Only the header of head_skb from ip_finish_output_gso ->
skb_gso_segment is updated but following frag_skbs are not updated.
A call path skb_mac_gso_segment -> inet_gso_segment ->
udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
does not try to update UDP/IP header of the segment list but copy
only the MAC header.
Update port, addr and check of each skb of the segment list in
__udp_gso_segment_list. It covers both SNAT and DNAT.
Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://lore.kernel.org/r/1611962007-80092-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dev->hard_header_len for tunnel interface is set only when header_ops
are set too and already contains full overhead of any tunnel encapsulation.
That's why there is not need to use this overhead twice in mtu calc.
Fixes: fdafed459998 ("ip_gre: set dev->hard_header_len and dev->needed_headroom properly")
Reported-by: Slava Bacherikov <mail@slava.cc>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1611959267-20536-1-git-send-email-vfedorenko@novek.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are multiple similar bugs implicitly introduced by the
commit c0cfa2d8a788fcf4 ("vsock: add multi-transports support") and
commit 6a2c0962105ae8ce ("vsock: prevent transport modules unloading").
The bug pattern:
[1] vsock_sock.transport pointer is copied to a local variable,
[2] lock_sock() is called,
[3] the local variable is used.
VSOCK multi-transport support introduced the race condition:
vsock_sock.transport value may change between [1] and [2].
Let's copy vsock_sock.transport pointer to local variables after
the lock_sock() call.
Fixes: c0cfa2d8a788fcf4 ("vsock: add multi-transports support")
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Link: https://lore.kernel.org/r/20210201084719.2257066-1-alex.popov@linux.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If register_netdevice() fails after having called cfg80211's
netdev notifier (cfg80211_netdev_notifier_call) it will call
the notifier again with UNREGISTER. This would then lock the
wiphy mutex because we're marked as registered, which causes
a deadlock.
Fix this by separately keeping track of whether or not we're
in the middle of registering to also skip the notifier call
on this unregister.
Reported-by: syzbot+2ae0ca9d7737ad1a62b7@syzkaller.appspotmail.com
Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver")
Link: https://lore.kernel.org/r/20210201192048.ed8bad436737.I7cae042c44b15f80919a285799a15df467e9d42d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If a peer device doesn't support eSCO 2M we should skip the params that
use it when setting up sync connection since they will always fail.
Signed-off-by: Yu Liu <yudiliu@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Anj Duvnjak reports that the Kodi.tv NFS client is not able to read
video files from a v5.10.11 Linux NFS server.
The new sendpage-based TCP sendto logic was not attentive to non-
zero page_base values. nfsd_splice_read() sets that field when a
READ payload starts in the middle of a page.
The Linux NFS client rarely emits an NFS READ that is not page-
aligned. All of my testing so far has been with Linux clients, so I
missed this one.
Reported-by: A. Duvnjak <avian@extremenerds.net>
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=211471
Fixes: 4a85a6a3320b ("SUNRPC: Handle TCP socket sends with kernel_sendpage() again")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: A. Duvnjak <avian@extremenerds.net>
If the driver uses .sta_add, station entries are only uploaded after the sta
is in assoc state. Fix early station rate table updates by deferring them
until the sta has been uploaded.
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210201083324.3134-1-nbd@nbd.name
[use rcu_access_pointer() instead since we won't dereference here]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Highlights include:
Bugfixes:
- SUNRPC: Handle 0 length opaque XDR object data properly
- Fix a layout segment leak in pnfs_layout_process()
- pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
- pNFS/NFSv4: Improve rejection of out-of-order layouts
- pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmAW4QgACgkQZwvnipYK
APJNZw/6AnLawj0kjn7z0Wc2LA0QWxbAVGYGe28gQdy6qiBbuOiFDeH8itKk6m1c
R6ZPpFHFKYk6+CsNcNws2sz9gBQj7wzDIy3sHenIaiNgY/fWNKDC8woKkJFSUSMl
GSQ9rkCYwRJu1JxP7r/9gnw/86oUTy/PgMaGdz6CMZJlq9iNa8t2UqMOfmcN8EZ3
AIewe4fSV5ebfycVz6btdJy8OCwyUfQ1OMilfh+0+5HYlk/xUxr57+AHi9r8w6bq
3tzIq3imQRgZsPPo/DJo/D4hfeFYX849/Tp+I5ydREWIwREBz2PO8bHNFnDoeoLo
AJ8mkawvpx+jsHFaAHql6STvY7uTY7qqBqsX2qSCqd6n2VEU0+cnDCY1IcgjcfBR
ozaYHJQm9ZhHzska3r/aKBQmkth9LIPU6aIMcYtjzC3ywua2vfCBSPRYKES80kIV
Pzgf5yRZFTEp7jGV9Uhf3Hucm3oIF9WVonDpSPbThdHUUXAYAVK1HZwgWx72HskL
BEhdaD+zsacv58C1+BE3vlh6A/j/cZAQifTfflgkLE3JE1IiKJwFjH4q6jgLwccx
kWLopK9Ds+ta+kLtlCuNTsPt7aGUoZZleH1Ghzdkw5Dfv2eEnR3YM6raa294avw4
DzKE/Rzgv5JuoSJhkWW/PiBZHcxMsv3SK7LTjO2oteFz88olsgo=
=gLzv
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
- SUNRPC: Handle 0 length opaque XDR object data properly
- Fix a layout segment leak in pnfs_layout_process()
- pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
- pNFS/NFSv4: Improve rejection of out-of-order layouts
- pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
* tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Handle 0 length opaque XDR object data properly
SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
pNFS/NFSv4: Improve rejection of out-of-order layouts
pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
Following race condition was detected:
<CPU A, t0> - neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead.
<CPU B, t1> - Executing: __netif_receive_skb() ->
__netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
calls __neigh_lookup() which takes a reference on neighbour entry 'n'.
<CPU A, t2> - Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t2,
'n' couldn't be destroyed.
<CPU B, t3> - Moves further along, arp_process() and calls
neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
the neighbour entry back in gc_list(neigh_mark_dead(), removed it
earlier in t0 from gc_list)
<CPU B, t4> - arp_process() finally calls neigh_release(n), destroying
the neighbour entry.
This leads to 'n' still being part of gc_list, but the actual
neighbour structure has been freed.
The situation can be prevented from happening if we disallow a dead
entry to have any possibility of updating gc_list. This is what the
patch intends to achieve.
Fixes: 9c29a2f55ec0 ("neighbor: Fix locking order for gc_list changes")
Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
AF_RXRPC sockets use UDP ports in encap mode. This causes socket and dst
from an incoming packet to get stolen and attached to the UDP socket from
whence it is leaked when that socket is closed.
When a network namespace is removed, the wait for dst records to be cleaned
up happens before the cleanup of the rxrpc and UDP socket, meaning that the
wait never finishes.
Fix this by moving the rxrpc (and, by dependence, the afs) private
per-network namespace registrations to the device group rather than subsys
group. This allows cached rxrpc local endpoints to be cleared and their
UDP sockets closed before we try waiting for the dst records.
The symptom is that lines looking like the following:
unregister_netdevice: waiting for lo to become free
get emitted at regular intervals after running something like the
referenced syzbot test.
Thanks to Vadim for tracking this down and work out the fix.
Reported-by: syzbot+df400f2f24a1677cd7e0@syzkaller.appspotmail.com
Reported-by: Vadim Fedorenko <vfedorenko@novek.ru>
Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/161196443016.3868642.5577440140646403533.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We're moving to netlink-only options, so add comments in the bridge's
sysfs files to warn against adding any new sysfs entries.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We decided to stop adding new sysfs bridge options and continue with
netlink only, so remove hosts limit sysfs support.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are use cases for which the existing tagger, based on the NPI
(Node Processor Interface) functionality, is insufficient.
Namely:
- Frames injected through the NPI port bypass the frame analyzer, so no
source address learning is performed, no TSN stream classification,
etc.
- Flow control is not functional over an NPI port (PAUSE frames are
encapsulated in the same Extraction Frame Header as all other frames)
- There can be at most one NPI port configured for an Ocelot switch. But
in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI
port is currently either disabled, or operated as a plain user port
(albeit an internally-facing one). Having the ability to configure the
two CPU ports symmetrically could pave the way for e.g. creating a LAG
between them, to increase bandwidth seamlessly for the system.
So there is a desire to have an alternative to the NPI mode. This change
keeps the default tagger for the Seville and Felix switches as "ocelot",
but it can be changed via the following device attribute:
echo ocelot-8021q > /sys/class/<dsa-master>/dsa/tagging
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The existence of dsa_broadcast has generated some confusion in the past:
https://www.mail-archive.com/netdev@vger.kernel.org/msg365042.html
So let's document the existing dsa_port_notify and dsa_broadcast
functions and explain when each of them should be used.
Also, in fact, the in-between function has always been there but was
lacking a name, and is the main reason for this patch: dsa_tree_notify.
Refactor dsa_broadcast to use it.
This patch also moves dsa_broadcast (a top-level function) to dsa2.c,
where it really belonged in the first place, but had no companion so it
stood with dsa_port_notify.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The sja1105 implementation can be blind about this, but the felix driver
doesn't do exactly what it's being told, so it needs to know whether it
is a TX or an RX VLAN, so it can install the appropriate type of TCAM
rule.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use cache friendly helpers to better use cpu caches
while reading /proc/net/netstat
Tested on a platform with 256 threads (AMD Rome)
Before: 305 usec spent in netstat_seq_show()
After: 130 usec spent in netstat_seq_show()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210128162145.1703601-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The commit 41b14fb8724d ("net: Do not clear the sock TX queue in
sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
the commit e022f0b4a03f ("net: Introduce sk_tx_queue_mapping"). On the
other hand, the original commit had already put sk_tx_queue_clear() in
sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
sk_tx_queue_clear() is called twice in each path.
If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
currently works well because (i) sk_tx_queue_mapping is defined between
sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
However, if we move sk_tx_queue_mapping out of the no copy area, it
introduces a bug unintentionally.
Therefore, this patch adds a compile-time check to take care of the order
of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
sk_prot_alloc() so that it does the only allocation and its callers
initialize fields.
CC: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210128150217.6060-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch is to add csum offload support for gre header:
On the TX path in gre_build_header(), when CHECKSUM_PARTIAL's set
for inner proto, it will calculate the csum for outer proto, and
inner csum will be offloaded later. Otherwise, CHECKSUM_PARTIAL
and csum_start/offset will be set for outer proto, and the outer
csum will be offloaded later.
On the GSO path in gre_gso_segment(), when CHECKSUM_PARTIAL is
not set for inner proto and the hardware supports csum offload,
CHECKSUM_PARTIAL and csum_start/offset will be set for outer
proto, and outer csum will be offloaded later. Otherwise, it
will do csum for outer proto by calling gso_make_checksum().
Note that SCTP has to do the csum by itself for non GSO path in
sctp_packet_pack(), as gre_build_header() can't handle the csum
with CHECKSUM_PARTIAL set for SCTP CRC csum offload.
v1->v2:
- remove the SCTP part, as GRE dev doesn't support SCTP CRC CSUM
and it will always do checksum for SCTP in sctp_packet_pack()
when it's not a GSO packet.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload
while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload
for HW, which includes not only for TCP/UDP csum, but also for other
protocols' csum like GRE's.
However, in skb_csum_hwoffload_help() it only checks features against
NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP
packet and the features doesn't support NETIF_F_HW_CSUM, but supports
NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW
to do csum.
This patch is to support ip generic csum processing by checking
NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM |
NETIF_F_IPV6_CSUM) only for TCP and UDP.
Note that we're using skb->csum_offset to check if it's a TCP/UDP
proctol, this might be fragile. However, as Alex said, for now we
only have a few L4 protocols that are requesting Tx csum offload,
we'd better fix this until a new protocol comes with a same csum
offset.
v1->v2:
- not extend skb->csum_not_inet, but use skb->csum_offset to tell
if it's an UDP/TCP csum packet.
v2->v3:
- add a note in the changelog, as Willem suggested.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")
Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210127173256.13954-2-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Previously a temporary tasklet structure was initialized on the stack
using DECLARE_TASKLET_OLD() and then copied over and modified. Nothing
else in the kernel seems to use this pattern, so let's just call
tasklet_init() like everyone else.
Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210127173256.13954-1-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Give offloading drivers the direction of the offloaded ct flow,
this will be used for matches on direction (ct_state +/-rpl).
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It's better make 'pkt_sk()' inline here, as non-inline function
shouldn't occur in headers. Besides, this function is simple
enough to be inline.
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/20210127123302.29842-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
for debugfs files.
Reported-by: Abaci Robot<abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Some devices, e.g. the RTL8723BS bluetooth part, some USB attached devices,
completely drop from the bus on a system-suspend. These devices will
have their driver unbound and rebound on resume (when the dropping of
the bus gets detected) and will show up as a new HCI after resume.
These devices do not benefit from the suspend / resume handling work done
by the hci_suspend_notifier. At best this unnecessarily adds some time to
the suspend/resume time. But this may also actually cause problems, if the
code doing the driver unbinding runs after the pm-notifier then the
hci_suspend_notifier code will try to talk to a device which is now in
an uninitialized state.
This commit adds a new HCI_QUIRK_NO_SUSPEND_NOTIFIER quirk which allows
drivers to opt-out of the hci_suspend_notifier when they know beforehand
that their device will be fully re-initialized / reprobed on resume.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Validation of messages for get / del of a next hop is the same as will be
validation of messages for get of a resilient next hop group bucket. The
difference is that policy for resilient next hop group buckets is a
superset of that used for next-hop get.
It is therefore possible to reuse the code that validates the nhmsg fields,
extracts the next-hop ID, and validates that. To that end, extract from
nh_valid_get_del_req() a helper __nh_valid_get_del_req() that does just
that.
Make the nlh argument const so that the function can be called from the
dump context, which only has a const nlh. Propagate the constness to
nh_valid_get_del_req().
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to allow different handling for next-hop tree dumper and for
bucket dumper, parameterize the next-hop tree walker with a callback. Add
rtm_dump_nexthop_cb() with just the bits relevant for next-hop tree
dumping.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extract from rtm_dump_nexthop() a helper to walk the next hop tree. A
separate function for this will be reusable from the bucket dumper.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The dump operations need to keep state from one invocation to another. A
scratch area is dedicated for this purpose in the passed-in argument, cb,
namely via two aliased arrays, struct netlink_callback.args and .ctx.
Dumping of buckets will end up having to iterate over next hops as well,
and it would be nice to be able to reuse the iteration logic with the NH
dumper. The fact that the logic currently relies on fixed index to the
.args array, and the indices would have to be coordinated between the two
dumpers, makes this somewhat awkward.
To make the access patters clearer, introduce a helper struct with a NH
index, and instead of using the .args array directly, use it through this
structure.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. However, they
have different policies. To allow reuse of this code, extract a
policy-agnostic wrapper out of nh_valid_dump_req(), and convert this
function into a thin wrapper around it.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. In order to make
reuse of this code simpler, convert the code to use a single structure with
filtering configuration instead of passing around the parameters one by
one.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After there are several next-hop group types, initialization and
finalization of notifier type needs to reflect the actual type. Transform
nh_notifier_grp_info_init() and _fini() to make extending them easier.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently there are only two types of in-kernel nexthop notification.
The two are distinguished by the 'is_grp' boolean field in 'struct
nh_notifier_info'.
As more notification types are introduced for more next-hop group types, a
boolean is not an easily extensible interface. Instead, convert it to an
enum.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Most of the code that deals with nexthop groups relies on the fact that the
group is of exactly one well-known type. Currently there is only one type,
"mpath", but as more next-hop group types come, it becomes desirable to
have a central place where the setting is validated. Introduce such place
into nexthop_create_group(), such that the check is done before the code
that relies on that invariant is invoked.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The values that a next-hop group needs to keep track of depend on the group
type. Introduce a union to separate fields specific to the mpath groups
from fields specific to other group types.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The logic for selecting path depends on the next-hop group type. Adapt the
nexthop_select_path() to dispatch according to the group type.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nexthop_free_mpath really should be nexthop_free_group. Rename it.
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>