64095 Commits

Author SHA1 Message Date
Marco Elver
097b9146c0 net: fix up truesize of cloned skb in skb_prepare_for_shift()
Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when
cloning an skb, save and restore truesize after pskb_expand_head(). This
can occur if the allocator decides to service an allocation of the same
size differently (e.g. use a different size class, or pass the
allocation on to KFENCE).

Because truesize is used for bookkeeping (such as sk_wmem_queued), a
modified truesize of a cloned skb may result in corrupt bookkeeping and
relevant warnings (such as in sk_stream_kill_queues()).

Link: https://lkml.kernel.org/r/X9JR/J6dMMOy1obu@elver.google.com
Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210201160420.2826895-1-elver@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 17:36:12 -08:00
Davide Caratti
ec99a470c7 mptcp: fix length of MP_PRIO suboption
With version 0 of the protocol it was legal to encode the 'Subflow Id' in
the MP_PRIO suboption, to specify which subflow would change its 'Backup'
flag. This has been removed from v1 specification: thus, according to RFC
8684 §3.3.8, the resulting 'Length' for MP_PRIO changed from 4 to 3 byte.

Current Linux generates / parses MP_PRIO according to the old spec, using
'Length' equal to 4, and hardcoding 1 as 'Subflow Id'; RFC compliance can
improve if we change 'Length' in other to become 3, leaving a 'Nop' after
the MP_PRIO suboption. In this way the kernel will emit and accept *only*
MP_PRIO suboptions that are compliant to version 1 of the MPTCP protocol.

 unpatched 5.11-rc kernel:
 [root@bottarga ~]# tcpdump -tnnr unpatched.pcap | grep prio
 reading from file unpatched.pcap, link-type LINUX_SLL (Linux cooked v1)
 dropped privs to tcpdump
 IP 10.0.3.2.48433 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options [nop,nop,TS val 4032325513 ecr 1876514270,mptcp prio non-backup id 1,mptcp dss ack 14084896651682217737], length 0

 patched 5.11-rc kernel:
 [root@bottarga ~]# tcpdump -tnnr patched.pcap | grep prio
 reading from file patched.pcap, link-type LINUX_SLL (Linux cooked v1)
 dropped privs to tcpdump
 IP 10.0.3.2.49735 > 10.0.1.1.10006: Flags [.], ack 1, win 502, options [nop,nop,TS val 1276737699 ecr 2686399734,mptcp prio non-backup,nop,mptcp dss ack 18433038869082491686], length 0

Changes since v2:
 - when accounting for option space, don't increment 'TCPOLEN_MPTCP_PRIO'
   and use 'TCPOLEN_MPTCP_PRIO_ALIGN' instead, thanks to Matthieu Baerts.
Changes since v1:
 - refactor patch to avoid using 'TCPOLEN_MPTCP_PRIO' with its old value,
   thanks to Geliang Tang.

Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Matteo Croce <mcroce@linux.microsoft.com>
Link: https://lore.kernel.org/r/846cdd41e6ad6ec88ef23fee1552ab39c2f5a3d1.1612184361.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 17:34:02 -08:00
Jakub Kicinski
d1e1355aef Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 14:21:31 -08:00
Linus Torvalds
a992562872 Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
trees.
 
 Current release - regressions:
 
  - ip_tunnel: fix mtu calculation
 
  - mlx5: fix function calculation for page trees
 
 Previous releases - regressions:
 
  - vsock: fix the race conditions in multi-transport support
 
  - neighbour: prevent a dead entry from updating gc_list
 
  - dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
 
 Previous releases - always broken:
 
  - bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
                 cgroup getsockopt infra when user space is trying
 		to race against optlen, from Loris Reiff.
 
  - bpf: add missing fput() in BPF inode storage map update helper
 
  - udp: ipv4: manipulate network header of NATed UDP GRO fraglist
 
  - mac80211: fix station rate table updates on assoc
 
  - r8169: work around RTL8125 UDP HW bug
 
  - igc: report speed and duplex as unknown when device is runtime
         suspended
 
  - rxrpc: fix deadlock around release of dst cached on udp tunnel
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmAZjwQACgkQMUZtbf5S
 IruLbQ//Yg9+xEnqhDuOJZtYHB0rsJjLlKmtvgOsBr8BaTcUEPoPoqUPm+EMvCHb
 o1fFa1qIrbS5luVEofu9hNX7DGXwvgawaMW2TympJhqLZQqjazCMB/st99LphhJw
 RvaZI8aDOikosT4c+I0vm83jDQETonrjziIcPfHHPjn/Q+amGRRRXiTSQnRF/MlU
 oARCG+U3kHsHBDUPNSCtSjKXshoZPjFb/pD7fQAlzzm7CssvbPhNWbducueyP2Fb
 XW4RwJu9QBBH2JS6uZJ1Y6LVoRzusmE9dUam3KhkiL/CHs72lWPsc+Rn5gbBPvc5
 Y4T4h61Xti1O4ULKdqhGceror6XY+4Qb1VlHWWztOhIo00wIAv3IHbTup/4o0HBr
 j84MtcyOl/qxSFXjunPJkbWJngXikrkIMS0Bl6ZcPAejYM9wN6vCgbvFCHbEg1Rx
 cWFnYyS9FCLduaxHSizv050tWhknOdX+zHK3fOtlW0yWnreJAB8Hoc21Zm7YKvg0
 GxxcGK6AhqJ6s2ixVDv7MyJrltJ/hOJQb+T3HgHFuY2BYUs8F2r/HoHU/u4uCl76
 RdBzbC/sLnBpMHf6r1rHTnGPsapoJOOYWnej71l425vX1qr5xnmxVNNB6HReObNv
 +/jPoRYa5BVsVt2LmDcuH1O32pXJPWKVBR7Yfa6Bn2yzhcbECTc=
 =ZByM
 -----END PGP SIGNATURE-----

Merge tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
  trees.

  Current release - regressions:

   - ip_tunnel: fix mtu calculation

   - mlx5: fix function calculation for page trees

  Previous releases - regressions:

   - vsock: fix the race conditions in multi-transport support

   - neighbour: prevent a dead entry from updating gc_list

   - dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add

  Previous releases - always broken:

   - bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
     cgroup getsockopt infra when user space is trying to race against
     optlen, from Loris Reiff.

   - bpf: add missing fput() in BPF inode storage map update helper

   - udp: ipv4: manipulate network header of NATed UDP GRO fraglist

   - mac80211: fix station rate table updates on assoc

   - r8169: work around RTL8125 UDP HW bug

   - igc: report speed and duplex as unknown when device is runtime
     suspended

   - rxrpc: fix deadlock around release of dst cached on udp tunnel"

* tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits)
  net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
  net: ipa: fix two format specifier errors
  net: ipa: use the right accessor in ipa_endpoint_status_skip()
  net: ipa: be explicit about endianness
  net: ipa: add a missing __iomem attribute
  net: ipa: pass correct dma_handle to dma_free_coherent()
  r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
  net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
  net: mvpp2: TCAM entry enable should be written after SRAM data
  net: lapb: Copy the skb before sending a packet
  net/mlx5e: Release skb in case of failure in tc update skb
  net/mlx5e: Update max_opened_tc also when channels are closed
  net/mlx5: Fix leak upon failure of rule creation
  net/mlx5: Fix function calculation for page trees
  docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc
  udp: ipv4: manipulate network header of NATed UDP GRO fraglist
  net: ip_tunnel: fix mtu calculation
  vsock: fix the race conditions in multi-transport support
  net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
  ibmvnic: device remove has higher precedence over reset
  ...
2021-02-02 10:26:09 -08:00
Andreas Oetken
6c9f18f294 net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
sup_multicast_addr is passed to ether_addr_equal for address comparison
which casts the address inputs to u16 leading to an unaligned access.
Aligning the sup_multicast_addr to u16 boundary fixes the issue.

Signed-off-by: Andreas Oetken <andreas.oetken@siemens.com>
Link: https://lore.kernel.org/r/20210202090304.2740471-1-ennoerlangen@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 08:57:18 -08:00
Sabyrzhan Tasbolatov
a11148e6fc net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
syzbot found WARNING in rds_rdma_extra_size [1] when RDS_CMSG_RDMA_ARGS
control message is passed with user-controlled
0x40001 bytes of args->nr_local, causing order >= MAX_ORDER condition.

The exact value 0x40001 can be checked with UIO_MAXIOV which is 0x400.
So for kcalloc() 0x400 iovecs with sizeof(struct rds_iovec) = 0x10
is the closest limit, with 0x10 leftover.

Same condition is currently done in rds_cmsg_rdma_args().

[1] WARNING: mm/page_alloc.c:5011
[..]
Call Trace:
 alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267
 alloc_pages include/linux/gfp.h:547 [inline]
 kmalloc_order+0x2e/0xb0 mm/slab_common.c:837
 kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853
 kmalloc_array include/linux/slab.h:592 [inline]
 kcalloc include/linux/slab.h:621 [inline]
 rds_rdma_extra_size+0xb2/0x3b0 net/rds/rdma.c:568
 rds_rm_size net/rds/send.c:928 [inline]

Reported-by: syzbot+1bd2b07f93745fa38425@syzkaller.appspotmail.com
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Link: https://lore.kernel.org/r/20210201203233.1324704-1-snovitoll@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 08:44:08 -08:00
Xie He
88c7a9fd9b net: lapb: Copy the skb before sending a packet
When sending a packet, we will prepend it with an LAPB header.
This modifies the shared parts of a cloned skb, so we should copy the
skb rather than just clone it, before we prepend the header.

In "Documentation/networking/driver.rst" (the 2nd point), it states
that drivers shouldn't modify the shared parts of a cloned skb when
transmitting.

The "dev_queue_xmit_nit" function in "net/core/dev.c", which is called
when an skb is being sent, clones the skb and sents the clone to
AF_PACKET sockets. Because the LAPB drivers first remove a 1-byte
pseudo-header before handing over the skb to us, if we don't copy the
skb before prepending the LAPB header, the first byte of the packets
received on AF_PACKET sockets can be corrupted.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/20210201055706.415842-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 08:40:48 -08:00
Jakub Kicinski
f418bad6cc Two fixes:
- station rate tables were not updated correctly
    after association, leading to bad configuration
  - rtl8723bs (staging) was initializing data incorrectly
    after the previous fix and needed to move the init
    later
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmAZYesACgkQB8qZga/f
 l8QOpA//V6cBc+ie4AoWKLU/isc33IiZG15qbwvXrHVouLaMgSg4SKQ3cQMwRKd7
 Y6cDJynj4EkIo7RZlRMIo5Dsefm2sv6L41tDhfVwR8Z7+wOmGKSXsmYABpjWttaZ
 4ABEqU2ZUPI3cm5iYF7qKKSNbwSqHeCWjlWnB3TFB8NfzTC+x7uSTQ/8C8/GZ7aF
 tkELHgvdig9+FbdKOs52lmIneTYLEDqxMuv+65XtE9flaYvgWiXCj5ilVVDo8Tjd
 vdCST8ux/9YIEcrhlM+SUM1OFO6AIqZ3EX5S2ZzJdc37PMDDy+nNr95cFrlN4EQ8
 y9avIS0Z+mvw/R7KSDc7XKInpvleC7bzR9DZVQsF8hdV9iB0cmVKyPmASfGpft69
 Ndv2+h2vmWvSHmJDpiroSvTY9WT+AgWCihOU/tj0PrKs+XNLUFfrO08BxFGnaRK/
 +MXzXY7ZmgfU9BFgmlAS2ejRbqfb3V6F5qa2Obj+3gq/SbM9W4Jl8RHiiox7szse
 GdLrT/LjvVEFC/cEMqDzvnGpnVosNkNtJRFMAaGyKs1g/uljl9A51HRZ8HdLrgv9
 bVsMripcQX2JMMxqBwbyfdzPBE0MX8ExkMhyuFbdUyWGEWJqsz4+irr25Bhcyoge
 RaRI6/xPM7DOkB9CDdbvJItBJ9GHYz6gvf+ZiIdu+ClpQ+b3k3s=
 =mEgj
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Two fixes:
 - station rate tables were not updated correctly
   after association, leading to bad configuration
 - rtl8723bs (staging) was initializing data incorrectly
   after the previous fix and needed to move the init
   later

* tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
  staging: rtl8723bs: Move wiphy setup to after reading the regulatory settings from the chip
  mac80211: fix station rate table updates on assoc
====================

Link: https://lore.kernel.org/r/20210202143505.37610-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 08:37:01 -08:00
Gopal Tiwari
e8bd76ede1 Bluetooth: Fix null pointer dereference in amp_read_loc_assoc_final_data
kernel panic trace looks like:

 #5 [ffffb9e08698fc80] do_page_fault at ffffffffb666e0d7
 #6 [ffffb9e08698fcb0] page_fault at ffffffffb70010fe
    [exception RIP: amp_read_loc_assoc_final_data+63]
    RIP: ffffffffc06ab54f  RSP: ffffb9e08698fd68  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff8c8845a5a000  RCX: 0000000000000004
    RDX: 0000000000000000  RSI: ffff8c8b9153d000  RDI: ffff8c8845a5a000
    RBP: ffffb9e08698fe40   R8: 00000000000330e0   R9: ffffffffc0675c94
    R10: ffffb9e08698fe58  R11: 0000000000000001  R12: ffff8c8b9cbf6200
    R13: 0000000000000000  R14: 0000000000000000  R15: ffff8c8b2026da0b
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffb9e08698fda8] hci_event_packet at ffffffffc0676904 [bluetooth]
 #8 [ffffb9e08698fe50] hci_rx_work at ffffffffc06629ac [bluetooth]
 #9 [ffffb9e08698fe98] process_one_work at ffffffffb66f95e7

hcon->amp_mgr seems NULL triggered kernel panic in following line inside
function amp_read_loc_assoc_final_data

        set_bit(READ_LOC_AMP_ASSOC_FINAL, &mgr->state);

Fixed by checking NULL for mgr.

Signed-off-by: Gopal Tiwari <gtiwari@redhat.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-02-02 17:23:14 +01:00
Dongseok Yi
c3df39ac9b udp: ipv4: manipulate network header of NATed UDP GRO fraglist
UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
forwarding. Only the header of head_skb from ip_finish_output_gso ->
skb_gso_segment is updated but following frag_skbs are not updated.

A call path skb_mac_gso_segment -> inet_gso_segment ->
udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
does not try to update UDP/IP header of the segment list but copy
only the MAC header.

Update port, addr and check of each skb of the segment list in
__udp_gso_segment_list. It covers both SNAT and DNAT.

Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Link: https://lore.kernel.org/r/1611962007-80092-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-01 20:02:16 -08:00
Vadim Fedorenko
28e104d002 net: ip_tunnel: fix mtu calculation
dev->hard_header_len for tunnel interface is set only when header_ops
are set too and already contains full overhead of any tunnel encapsulation.
That's why there is not need to use this overhead twice in mtu calc.

Fixes: fdafed459998 ("ip_gre: set dev->hard_header_len and dev->needed_headroom properly")
Reported-by: Slava Bacherikov <mail@slava.cc>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1611959267-20536-1-git-send-email-vfedorenko@novek.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-01 19:58:23 -08:00
Alexander Popov
c518adafa3 vsock: fix the race conditions in multi-transport support
There are multiple similar bugs implicitly introduced by the
commit c0cfa2d8a788fcf4 ("vsock: add multi-transports support") and
commit 6a2c0962105ae8ce ("vsock: prevent transport modules unloading").

The bug pattern:
 [1] vsock_sock.transport pointer is copied to a local variable,
 [2] lock_sock() is called,
 [3] the local variable is used.
VSOCK multi-transport support introduced the race condition:
vsock_sock.transport value may change between [1] and [2].

Let's copy vsock_sock.transport pointer to local variables after
the lock_sock() call.

Fixes: c0cfa2d8a788fcf4 ("vsock: add multi-transports support")
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Link: https://lore.kernel.org/r/20210201084719.2257066-1-alex.popov@linux.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-01 19:54:30 -08:00
Johannes Berg
40c575d1ec cfg80211: fix netdev registration deadlock
If register_netdevice() fails after having called cfg80211's
netdev notifier (cfg80211_netdev_notifier_call) it will call
the notifier again with UNREGISTER. This would then lock the
wiphy mutex because we're marked as registered, which causes
a deadlock.

Fix this by separately keeping track of whether or not we're
in the middle of registering to also skip the notifier call
on this unregister.

Reported-by: syzbot+2ae0ca9d7737ad1a62b7@syzkaller.appspotmail.com
Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver")
Link: https://lore.kernel.org/r/20210201192048.ed8bad436737.I7cae042c44b15f80919a285799a15df467e9d42d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-02-01 19:30:54 +01:00
Yu Liu
8b1c324c9f Bluetooth: Skip eSCO 2M params when not supported
If a peer device doesn't support eSCO 2M we should skip the params that
use it when setting up sync connection since they will always fail.

Signed-off-by: Yu Liu <yudiliu@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-02-01 17:04:17 +01:00
Chuck Lever
bad4c6eb5e SUNRPC: Fix NFS READs that start at non-page-aligned offsets
Anj Duvnjak reports that the Kodi.tv NFS client is not able to read
video files from a v5.10.11 Linux NFS server.

The new sendpage-based TCP sendto logic was not attentive to non-
zero page_base values. nfsd_splice_read() sets that field when a
READ payload starts in the middle of a page.

The Linux NFS client rarely emits an NFS READ that is not page-
aligned. All of my testing so far has been with Linux clients, so I
missed this one.

Reported-by: A. Duvnjak <avian@extremenerds.net>
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=211471
Fixes: 4a85a6a3320b ("SUNRPC: Handle TCP socket sends with kernel_sendpage() again")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: A. Duvnjak <avian@extremenerds.net>
2021-02-01 10:03:51 -05:00
Felix Fietkau
18fe0fae61 mac80211: fix station rate table updates on assoc
If the driver uses .sta_add, station entries are only uploaded after the sta
is in assoc state. Fix early station rate table updates by deferring them
until the sta has been uploaded.

Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210201083324.3134-1-nbd@nbd.name
[use rcu_access_pointer() instead since we won't dereference here]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-02-01 15:07:09 +01:00
Linus Torvalds
c178fae3a9 NFS client bugfixes for Linux 5.11
Highlights include:
 
 Bugfixes:
 - SUNRPC: Handle 0 length opaque XDR object data properly
 - Fix a layout segment leak in pnfs_layout_process()
 - pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
 - pNFS/NFSv4: Improve rejection of out-of-order layouts
 - pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmAW4QgACgkQZwvnipYK
 APJNZw/6AnLawj0kjn7z0Wc2LA0QWxbAVGYGe28gQdy6qiBbuOiFDeH8itKk6m1c
 R6ZPpFHFKYk6+CsNcNws2sz9gBQj7wzDIy3sHenIaiNgY/fWNKDC8woKkJFSUSMl
 GSQ9rkCYwRJu1JxP7r/9gnw/86oUTy/PgMaGdz6CMZJlq9iNa8t2UqMOfmcN8EZ3
 AIewe4fSV5ebfycVz6btdJy8OCwyUfQ1OMilfh+0+5HYlk/xUxr57+AHi9r8w6bq
 3tzIq3imQRgZsPPo/DJo/D4hfeFYX849/Tp+I5ydREWIwREBz2PO8bHNFnDoeoLo
 AJ8mkawvpx+jsHFaAHql6STvY7uTY7qqBqsX2qSCqd6n2VEU0+cnDCY1IcgjcfBR
 ozaYHJQm9ZhHzska3r/aKBQmkth9LIPU6aIMcYtjzC3ywua2vfCBSPRYKES80kIV
 Pzgf5yRZFTEp7jGV9Uhf3Hucm3oIF9WVonDpSPbThdHUUXAYAVK1HZwgWx72HskL
 BEhdaD+zsacv58C1+BE3vlh6A/j/cZAQifTfflgkLE3JE1IiKJwFjH4q6jgLwccx
 kWLopK9Ds+ta+kLtlCuNTsPt7aGUoZZleH1Ghzdkw5Dfv2eEnR3YM6raa294avw4
 DzKE/Rzgv5JuoSJhkWW/PiBZHcxMsv3SK7LTjO2oteFz88olsgo=
 =gLzv
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:

 - SUNRPC: Handle 0 length opaque XDR object data properly

 - Fix a layout segment leak in pnfs_layout_process()

 - pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn

 - pNFS/NFSv4: Improve rejection of out-of-order layouts

 - pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()

* tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: Handle 0 length opaque XDR object data properly
  SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
  pNFS/NFSv4: Improve rejection of out-of-order layouts
  pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
  pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
  pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
2021-01-31 11:19:12 -08:00
Chinmay Agarwal
eb4e8fac00 neighbour: Prevent a dead entry from updating gc_list
Following race condition was detected:
<CPU A, t0> - neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead.

<CPU B, t1> - Executing: __netif_receive_skb() ->
__netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
calls __neigh_lookup() which takes a reference on neighbour entry 'n'.

<CPU A, t2> - Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t2,
'n' couldn't be destroyed.

<CPU B, t3> - Moves further along, arp_process() and calls
neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
the neighbour entry back in gc_list(neigh_mark_dead(), removed it
earlier in t0 from gc_list)

<CPU B, t4> - arp_process() finally calls neigh_release(n), destroying
the neighbour entry.

This leads to 'n' still being part of gc_list, but the actual
neighbour structure has been freed.

The situation can be prevented from happening if we disallow a dead
entry to have any possibility of updating gc_list. This is what the
patch intends to achieve.

Fixes: 9c29a2f55ec0 ("neighbor: Fix locking order for gc_list changes")
Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-30 11:09:07 -08:00
David Howells
5399d52233 rxrpc: Fix deadlock around release of dst cached on udp tunnel
AF_RXRPC sockets use UDP ports in encap mode.  This causes socket and dst
from an incoming packet to get stolen and attached to the UDP socket from
whence it is leaked when that socket is closed.

When a network namespace is removed, the wait for dst records to be cleaned
up happens before the cleanup of the rxrpc and UDP socket, meaning that the
wait never finishes.

Fix this by moving the rxrpc (and, by dependence, the afs) private
per-network namespace registrations to the device group rather than subsys
group.  This allows cached rxrpc local endpoints to be cleared and their
UDP sockets closed before we try waiting for the dst records.

The symptom is that lines looking like the following:

	unregister_netdevice: waiting for lo to become free

get emitted at regular intervals after running something like the
referenced syzbot test.

Thanks to Vadim for tracking this down and work out the fix.

Reported-by: syzbot+df400f2f24a1677cd7e0@syzkaller.appspotmail.com
Reported-by: Vadim Fedorenko <vfedorenko@novek.ru>
Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/161196443016.3868642.5577440140646403533.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:38:11 -08:00
Nikolay Aleksandrov
1e16f382ae net: bridge: add warning comments to avoid extending sysfs
We're moving to netlink-only options, so add comments in the bridge's
sysfs files to warn against adding any new sysfs entries.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:33:02 -08:00
Nikolay Aleksandrov
7d0888d52f net: bridge: mcast: drop hosts limit sysfs support
We decided to stop adding new sysfs bridge options and continue with
netlink only, so remove hosts limit sysfs support.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:33:02 -08:00
Vladimir Oltean
7c83a7c539 net: dsa: add a second tagger for Ocelot switches based on tag_8021q
There are use cases for which the existing tagger, based on the NPI
(Node Processor Interface) functionality, is insufficient.

Namely:
- Frames injected through the NPI port bypass the frame analyzer, so no
  source address learning is performed, no TSN stream classification,
  etc.
- Flow control is not functional over an NPI port (PAUSE frames are
  encapsulated in the same Extraction Frame Header as all other frames)
- There can be at most one NPI port configured for an Ocelot switch. But
  in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI
  port is currently either disabled, or operated as a plain user port
  (albeit an internally-facing one). Having the ability to configure the
  two CPU ports symmetrically could pave the way for e.g. creating a LAG
  between them, to increase bandwidth seamlessly for the system.

So there is a desire to have an alternative to the NPI mode. This change
keeps the default tagger for the Seville and Felix switches as "ocelot",
but it can be changed via the following device attribute:

echo ocelot-8021q > /sys/class/<dsa-master>/dsa/tagging

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:25:27 -08:00
Vladimir Oltean
53da0ebaad net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot

which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").

It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging

This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.

In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.

Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
  the initial tagging protocol driver. Nonetheless, new drivers should
  report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
  protocol, no later than the .setup() method, as well as destroying
  resources used by the last tagger in use, no earlier than the
  .teardown() method.

For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.

Testing:

$ insmod mscc_felix.ko
[   79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[   79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[   97.261724] libphy: VSC9959 internal MDIO bus: probed
[   97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[   97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[   97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[   97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[   97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[   97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[   98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[   98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[   98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[   98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[   98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[   98.527948] device eno2 entered promiscuous mode
[   98.544755] DSA: tree 0 setup

$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
 -  10.0.0.1 ping statistics  -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms

$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
        #!/bin/bash

        ip link set swp0 down
        ip link set swp1 down
        ip link set swp2 down
        ip link set swp3 down
        ip link set swp5 down
        ip link set eno2 down
        echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
        ip link set eno2 up
        ip link set swp0 up
        ip link set swp1 up
        ip link set swp2 up
        ip link set swp3 up
        ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[  645.544426] mscc_felix 0000:00:00.5: Link is Down
[  645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:24:39 -08:00
Vladimir Oltean
357f203bb3 net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.

So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.

It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).

This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?

One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.

Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.

There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:24:31 -08:00
Vladimir Oltean
886f8e26f5 net: dsa: document the existing switch tree notifiers and add a new one
The existence of dsa_broadcast has generated some confusion in the past:
https://www.mail-archive.com/netdev@vger.kernel.org/msg365042.html

So let's document the existing dsa_port_notify and dsa_broadcast
functions and explain when each of them should be used.

Also, in fact, the in-between function has always been there but was
lacking a name, and is the main reason for this patch: dsa_tree_notify.
Refactor dsa_broadcast to use it.

This patch also moves dsa_broadcast (a top-level function) to dsa2.c,
where it really belonged in the first place, but had no companion so it
stood with dsa_port_notify.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:24:30 -08:00
Vladimir Oltean
9c7caf2806 net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLAN
The sja1105 implementation can be blind about this, but the felix driver
doesn't do exactly what it's being told, so it needs to know whether it
is a TX or an RX VLAN, so it can install the appropriate type of TCAM
rule.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 21:24:30 -08:00
Eric Dumazet
0d6cd689f9 net: proc: speedup /proc/net/netstat
Use cache friendly helpers to better use cpu caches
while reading /proc/net/netstat

Tested on a platform with 256 threads (AMD Rome)

Before: 305 usec spent in netstat_seq_show()
After: 130 usec spent in netstat_seq_show()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210128162145.1703601-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 20:59:53 -08:00
Kuniyuki Iwashima
df610cd916 net: Remove redundant calls of sk_tx_queue_clear().
The commit 41b14fb8724d ("net: Do not clear the sock TX queue in
sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
the commit e022f0b4a03f ("net: Introduce sk_tx_queue_mapping"). On the
other hand, the original commit had already put sk_tx_queue_clear() in
sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
sk_tx_queue_clear() is called twice in each path.

If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
currently works well because (i) sk_tx_queue_mapping is defined between
sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
However, if we move sk_tx_queue_mapping out of the no copy area, it
introduces a bug unintentionally.

Therefore, this patch adds a compile-time check to take care of the order
of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
sk_prot_alloc() so that it does the only allocation and its callers
initialize fields.

CC: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210128150217.6060-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 20:57:25 -08:00
Xin Long
efa1a65c7e ip_gre: add csum offload support for gre header
This patch is to add csum offload support for gre header:

On the TX path in gre_build_header(), when CHECKSUM_PARTIAL's set
for inner proto, it will calculate the csum for outer proto, and
inner csum will be offloaded later. Otherwise, CHECKSUM_PARTIAL
and csum_start/offset will be set for outer proto, and the outer
csum will be offloaded later.

On the GSO path in gre_gso_segment(), when CHECKSUM_PARTIAL is
not set for inner proto and the hardware supports csum offload,
CHECKSUM_PARTIAL and csum_start/offset will be set for outer
proto, and outer csum will be offloaded later. Otherwise, it
will do csum for outer proto by calling gso_make_checksum().

Note that SCTP has to do the csum by itself for non GSO path in
sctp_packet_pack(), as gre_build_header() can't handle the csum
with CHECKSUM_PARTIAL set for SCTP CRC csum offload.

v1->v2:
  - remove the SCTP part, as GRE dev doesn't support SCTP CRC CSUM
    and it will always do checksum for SCTP in sctp_packet_pack()
    when it's not a GSO packet.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 20:39:14 -08:00
Xin Long
62fafcd631 net: support ip generic csum processing in skb_csum_hwoffload_help
NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload
while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload
for HW, which includes not only for TCP/UDP csum, but also for other
protocols' csum like GRE's.

However, in skb_csum_hwoffload_help() it only checks features against
NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP
packet and the features doesn't support NETIF_F_HW_CSUM, but supports
NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW
to do csum.

This patch is to support ip generic csum processing by checking
NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM |
NETIF_F_IPV6_CSUM) only for TCP and UDP.

Note that we're using skb->csum_offset to check if it's a TCP/UDP
proctol, this might be fragile. However, as Alex said, for now we
only have a few L4 protocols that are requesting Tx csum offload,
we'd better fix this until a new protocol comes with a same csum
offset.

v1->v2:
  - not extend skb->csum_not_inet, but use skb->csum_offset to tell
    if it's an UDP/TCP csum packet.
v2->v3:
  - add a note in the changelog, as Willem suggested.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 20:39:14 -08:00
Emil Renner Berthing
a58745979c net: atm: pppoatm: use new API for wakeup tasklet
This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210127173256.13954-2-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:24:05 -08:00
Emil Renner Berthing
a5b88632fc net: atm: pppoatm: use tasklet_init to initialize wakeup tasklet
Previously a temporary tasklet structure was initialized on the stack
using DECLARE_TASKLET_OLD() and then copied over and modified. Nothing
else in the kernel seems to use this pattern, so let's just call
tasklet_init() like everyone else.

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210127173256.13954-1-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:24:05 -08:00
Paul Blakey
941eff5aea net: flow_offload: Add original direction flag to ct_metadata
Give offloading drivers the direction of the offloaded ct flow,
this will be used for matches on direction (ct_state +/-rpl).

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:05:30 -08:00
Paul Blakey
8c85d18ce6 net/sched: cls_flower: Add match on the ct_state reply flag
Add match on the ct_state reply flag.

Example:
$ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est+rpl \
  action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_1 ingress prio 1 chain 1 proto ip flower \
  ct_state +trk+est-rpl \
  action mirred egress redirect dev ens1f0_0

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 18:05:30 -08:00
Menglong Dong
8c22475148 net: packet: make pkt_sk() inline
It's better make 'pkt_sk()' inline here, as non-inline function
shouldn't occur in headers. Besides, this function is simple
enough to be inline.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/20210127123302.29842-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 17:38:55 -08:00
Tomoyuki Matsushita
b8ddc3b14c Bluetooth: fix indentation and alignment reported by checkpatch
Signed-off-by: Tomoyuki Matsushita <xorphitus@fastmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-29 16:51:45 +01:00
Jiapeng Zhong
231ee8bd83 Bluetooth: fix coccicheck warnings debugfs
Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
for debugfs files.

Reported-by: Abaci Robot<abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-29 16:51:35 +01:00
Hans de Goede
219991e6be Bluetooth: Add new HCI_QUIRK_NO_SUSPEND_NOTIFIER quirk
Some devices, e.g. the RTL8723BS bluetooth part, some USB attached devices,
completely drop from the bus on a system-suspend. These devices will
have their driver unbound and rebound on resume (when the dropping of
the bus gets detected) and will show up as a new HCI after resume.

These devices do not benefit from the suspend / resume handling work done
by the hci_suspend_notifier. At best this unnecessarily adds some time to
the suspend/resume time. But this may also actually cause problems, if the
code doing the driver unbinding runs after the pm-notifier then the
hci_suspend_notifier code will try to talk to a device which is now in
an uninitialized state.

This commit adds a new HCI_QUIRK_NO_SUSPEND_NOTIFIER quirk which allows
drivers to opt-out of the hci_suspend_notifier when they know beforehand
that their device will be fully re-initialized / reprobed on resume.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-29 16:37:00 +01:00
Petr Machata
0bccf8ed8a nexthop: Extract a helper for validation of get/del RTNL requests
Validation of messages for get / del of a next hop is the same as will be
validation of messages for get of a resilient next hop group bucket. The
difference is that policy for resilient next hop group buckets is a
superset of that used for next-hop get.

It is therefore possible to reuse the code that validates the nhmsg fields,
extracts the next-hop ID, and validates that. To that end, extract from
nh_valid_get_del_req() a helper __nh_valid_get_del_req() that does just
that.

Make the nlh argument const so that the function can be called from the
dump context, which only has a const nlh. Propagate the constness to
nh_valid_get_del_req().

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:54 -08:00
Petr Machata
e948217d25 nexthop: Add a callback parameter to rtm_dump_walk_nexthops()
In order to allow different handling for next-hop tree dumper and for
bucket dumper, parameterize the next-hop tree walker with a callback. Add
rtm_dump_nexthop_cb() with just the bits relevant for next-hop tree
dumping.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:53 -08:00
Petr Machata
cbee18071e nexthop: Extract a helper for walking the next-hop tree
Extract from rtm_dump_nexthop() a helper to walk the next hop tree. A
separate function for this will be reusable from the bucket dumper.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:53 -08:00
Petr Machata
a6fbbaa64c nexthop: Strongly-type context of rtm_dump_nexthop()
The dump operations need to keep state from one invocation to another. A
scratch area is dedicated for this purpose in the passed-in argument, cb,
namely via two aliased arrays, struct netlink_callback.args and .ctx.

Dumping of buckets will end up having to iterate over next hops as well,
and it would be nice to be able to reuse the iteration logic with the NH
dumper. The fact that the logic currently relies on fixed index to the
.args array, and the indices would have to be coordinated between the two
dumpers, makes this somewhat awkward.

To make the access patters clearer, introduce a helper struct with a NH
index, and instead of using the .args array directly, use it through this
structure.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:53 -08:00
Petr Machata
b9ebea1276 nexthop: Extract a common helper for parsing dump attributes
Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. However, they
have different policies. To allow reuse of this code, extract a
policy-agnostic wrapper out of nh_valid_dump_req(), and convert this
function into a thin wrapper around it.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:53 -08:00
Petr Machata
56450ec6b7 nexthop: Extract dump filtering parameters into a single structure
Requests to dump nexthops have many attributes in common with those that
requests to dump buckets of resilient NH groups will have. In order to make
reuse of this code simpler, convert the code to use a single structure with
filtering configuration instead of passing around the parameters one by
one.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:52 -08:00
Petr Machata
da230501f2 nexthop: Dispatch notifier init()/fini() by group type
After there are several next-hop group types, initialization and
finalization of notifier type needs to reflect the actual type. Transform
nh_notifier_grp_info_init() and _fini() to make extending them easier.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:52 -08:00
Ido Schimmel
09ad6becf5 nexthop: Use enum to encode notification type
Currently there are only two types of in-kernel nexthop notification.
The two are distinguished by the 'is_grp' boolean field in 'struct
nh_notifier_info'.

As more notification types are introduced for more next-hop group types, a
boolean is not an easily extensible interface. Instead, convert it to an
enum.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:52 -08:00
Petr Machata
720ccd9a72 nexthop: Assert the invariant that a NH group is of only one type
Most of the code that deals with nexthop groups relies on the fact that the
group is of exactly one well-known type. Currently there is only one type,
"mpath", but as more next-hop group types come, it becomes desirable to
have a central place where the setting is validated. Introduce such place
into nexthop_create_group(), such that the check is done before the code
that relies on that invariant is invoked.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:51 -08:00
Petr Machata
b9bae61be4 nexthop: Introduce to struct nh_grp_entry a per-type union
The values that a next-hop group needs to keep track of depend on the group
type. Introduce a union to separate fields specific to the mpath groups
from fields specific to other group types.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:51 -08:00
Petr Machata
79bc55e3fe nexthop: Dispatch nexthop_select_path() by group type
The logic for selecting path depends on the next-hop group type. Adapt the
nexthop_select_path() to dispatch according to the group type.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:51 -08:00
David Ahern
5d1f0f09b5 nexthop: Rename nexthop_free_mpath
nexthop_free_mpath really should be nexthop_free_group. Rename it.

Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:49:51 -08:00