Commit Graph

79044 Commits

Author SHA1 Message Date
Philo Lu
1b29a730ef ipv6/udp: Add 4-tuple hash for connected socket
Implement ipv6 udp hash4 like that in ipv4. The major difference is that
the hash value should be calculated with udp6_ehashfn(). Besides,
ipv4-mapped ipv6 address is handled before hash() and rehash(). Export
udp_ehashfn because now we use it in udpv6 rehash.

Core procedures of hash/unhash/rehash are same as ipv4, and udpv4 and
udpv6 share the same udptable, so some functions in ipv4 hash4 can also
be shared.

Co-developed-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Co-developed-by: Fred Chen <fred.cc@alibaba-inc.com>
Signed-off-by: Fred Chen <fred.cc@alibaba-inc.com>
Co-developed-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-11-18 11:56:21 +00:00
Philo Lu
78c91ae2c6 ipv4/udp: Add 4-tuple hash for connected socket
Currently, the udp_table has two hash table, the port hash and portaddr
hash. Usually for UDP servers, all sockets have the same local port and
addr, so they are all on the same hash slot within a reuseport group.

In some applications, UDP servers use connect() to manage clients. In
particular, when firstly receiving from an unseen 4 tuple, a new socket
is created and connect()ed to the remote addr:port, and then the fd is
used exclusively by the client.

Once there are connected sks in a reuseport group, udp has to score all
sks in the same hash2 slot to find the best match. This could be
inefficient with a large number of connections, resulting in high
softirq overhead.

To solve the problem, this patch implement 4-tuple hash for connected
udp sockets. During connect(), hash4 slot is updated, as well as a
corresponding counter, hash4_cnt, in hslot2. In __udp4_lib_lookup(),
hslot4 will be searched firstly if the counter is non-zero. Otherwise,
hslot2 is used like before. Note that only connected sockets enter this
hash4 path, while un-connected ones are not affected.

hlist_nulls is used for hash4, because we probably move to another hslot
wrongly when lookup with concurrent rehash. Then we check nulls at the
list end to see if we should restart lookup. Because udp does not use
SLAB_TYPESAFE_BY_RCU, we don't need to touch sk_refcnt when lookup.

Stress test results (with 1 cpu fully used) are shown below, in pps:
(1) _un-connected_ socket as server
    [a] w/o hash4: 1,825176
    [b] w/  hash4: 1,831750 (+0.36%)

(2) 500 _connected_ sockets as server
    [c] w/o hash4:   290860 (only 16% of [a])
    [d] w/  hash4: 1,889658 (+3.1% compared with [b])

With hash4, compute_score is skipped when lookup, so [d] is slightly
better than [b].

Co-developed-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Co-developed-by: Fred Chen <fred.cc@alibaba-inc.com>
Signed-off-by: Fred Chen <fred.cc@alibaba-inc.com>
Co-developed-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-11-18 11:56:21 +00:00
Philo Lu
dab78a1745 net/udp: Add 4-tuple hash list basis
Add a new hash list, hash4, in udp table. It will be used to implement
4-tuple hash for connected udp sockets. This patch adds the hlist to
table, and implements helpers and the initialization. 4-tuple hash is
implemented in the following patch.

hash4 uses hlist_nulls to avoid moving wrongly onto another hlist due to
concurrent rehash, because rehash() can happen with lookup().

Co-developed-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Co-developed-by: Fred Chen <fred.cc@alibaba-inc.com>
Signed-off-by: Fred Chen <fred.cc@alibaba-inc.com>
Co-developed-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Yubing Qiu <yubing.qiuyubing@alibaba-inc.com>
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-11-18 11:56:21 +00:00
Philo Lu
accdd51dc7 net/udp: Add a new struct for hash2 slot
Preparing for udp 4-tuple hash (uhash4 for short).

To implement uhash4 without cache line missing when lookup, hslot2 is
used to record the number of hashed sockets in hslot4. Thus adding a new
struct udp_hslot_main with field hash4_cnt, which is used by hash2. The
new struct is used to avoid doubling the size of udp_hslot.

Before uhash4 lookup, firstly checking hash4_cnt to see if there are
hashed sks in hslot4. Because hslot2 is always used in lookup, there is
no cache line miss.

Related helpers are updated, and use the helpers as possible.

uhash4 is implemented in following patches.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-11-18 11:56:21 +00:00
David S. Miller
296a681def ipsec-next-2024-11-15
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmc3A/gACgkQrB3Eaf9P
 W7fNew//XCIhIvFYaQcP2x84T4EYB679NkGlwMATxXgn40+sp7muSwVweynEWNIu
 FltfBAwYD/MxD7g519abVPMWXs/iYI5duw3vvqnxmkOoebWLLocg2VoqFIdVXlQw
 /hj+1X/oNT4OKcaQAw/FAGRuYvkc90YB/rRG51RwAIR0tyBjRwfUsozMM8QX/zQI
 I0cLCgGAf/kylQre+dhvUkMhXaLogMF5v0qzPxhyMBD02JaUpe6+5cdHQcmKOhqa
 ksTpySYnIKIHZrLizeFGDZpinaDIph20vGaDvDXpqTYFuwvCQsZczJy02dF4otf2
 2dZz6+2La+ZM+WsGIqpALqKCNhr8fOcQxCRH3eGLPBwoXXt5CFAMgJKob8hKuonW
 FgJaYMBZOjYbgGah8WbEe/YsWq4y3uRs48pFtY+T5cn7AskNxIvUoLNjSS83Hlqu
 PJbveiKsZygig966Q/zUFATYnvj3zEgjVEcSbK6LRyBXL79Njr8l+PZ0Zoz76tc4
 bF1Xv0x+lRYmwa9rvOFaeqrP/GTe0xvlitFzuCN7HnXiN8URKnnDY2odkXYzo+Z7
 MBbP8wR/CaoiAvdMw74116nAIFOW95LPtvdGJTvlS9jAOt1P7dWQ3/mFKEpItndv
 cJjWzI7HKl0+85FcCDw+tmsDWWGbALUyPw96i8UgUcDGyqVKUgA=
 =Ioo8
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2024-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================

ipsec-next-11-15

1) Add support for RFC 9611 per cpu xfrm state handling.

2) Add inbound and outbound xfrm state caches to speed up
   state lookups.

3) Convert xfrm to dscp_t. From Guillaume Nault.

4) Fix error handling in build_aevent.
   From Everest K.C.

5) Replace strncpy with strscpy_pad in copy_to_user_auth.
   From Daniel Yang.

6) Fix an uninitialized symbol during acquire state insertion.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2024-11-18 11:52:49 +00:00
Mirsad Todorovac
ff1060813d net/9p/usbg: fix handling of the failed kzalloc() memory allocation
On the linux-next, next-20241108 vanilla kernel, the coccinelle tool gave the
following error report:

./net/9p/trans_usbg.c:912:5-11: ERROR: allocation function on line 911 returns
NULL not ERR_PTR on failure

kzalloc() failure is fixed to handle the NULL return case on the memory exhaustion.

Fixes: a3be076dc1 ("net/9p/usbg: Add new usb gadget function transport")
Cc: Michael Grzeschik <m.grzeschik@pengutronix.de>
Cc: Eric Van Hensbergen <ericvh@kernel.org>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Christian Schoenebeck <linux_oss@crudebyte.com>
Cc: v9fs@lists.linux.dev
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Message-ID: <20241109211840.721226-2-mtodorovac69@gmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2024-11-16 17:23:19 +09:00
Petr Machata
42575ad5aa ndo_fdb_del: Add a parameter to report whether notification was sent
In a similar fashion to ndo_fdb_add, which was covered in the previous
patch, add the bool *notified argument to ndo_fdb_del. Callees that send a
notification on their own set the flag to true.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/06b1acf4953ef0a5ed153ef1f32d7292044f2be6.1731589511.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 16:39:18 -08:00
Petr Machata
4b42fbc6bd ndo_fdb_add: Add a parameter to report whether notification was sent
Currently when FDB entries are added to or deleted from a VXLAN netdevice,
the VXLAN driver emits one notification, including the VXLAN-specific
attributes. The core however always sends a notification as well, a generic
one. Thus two notifications are unnecessarily sent for these operations. A
similar situation comes up with bridge driver, which also emits
notifications on its own:

 # ip link add name vx type vxlan id 1000 dstport 4789
 # bridge monitor fdb &
 [1] 1981693
 # bridge fdb add de:ad:be:ef:13:37 dev vx self dst 192.0.2.1
 de:ad:be:ef:13:37 dev vx dst 192.0.2.1 self permanent
 de:ad:be:ef:13:37 dev vx self permanent

In order to prevent this duplicity, add a paremeter to ndo_fdb_add,
bool *notified. The flag is primed to false, and if the callee sends a
notification on its own, it sets it to true, thus informing the core that
it should not generate another notification.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/cbf6ae8195e85cbf922f8058ce4eba770f3b71ed.1731589511.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 16:39:18 -08:00
Breno Leitao
6c59f16f17 net: netpoll: flush skb pool during cleanup
The netpoll subsystem maintains a pool of 32 pre-allocated SKBs per
instance, but these SKBs are not freed when the netpoll user is brought
down. This leads to memory waste as these buffers remain allocated but
unused.

Add skb_pool_flush() to properly clean up these SKBs when netconsole is
terminated, improving memory efficiency.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20241114-skb_buffers_v2-v3-2-9be9f52a8b69@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 16:25:34 -08:00
Breno Leitao
221a9c1df7 net: netpoll: Individualize the skb pool
The current implementation of the netpoll system uses a global skb
pool, which can lead to inefficient memory usage and
waste when targets are disabled or no longer in use.

This can result in a significant amount of memory being unnecessarily
allocated and retained, potentially causing performance issues and
limiting the availability of resources for other system components.

Modify the netpoll system to assign a skb pool to each target instead of
using a global one.

This approach allows for more fine-grained control over memory
allocation and deallocation, ensuring that resources are only allocated
and retained as needed.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20241114-skb_buffers_v2-v3-1-9be9f52a8b69@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 16:25:34 -08:00
Dmitry Safonov
e51edeaf35 net/netlink: Correct the comment on netlink message max cap
Since commit d35c99ff77 ("netlink: do not enter direct reclaim from
netlink_dump()") the cap is 32KiB.

Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://patch.msgid.link/20241113-tcp-md5-diag-prep-v2-5-00a2a7feb1fa@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 16:14:16 -08:00
Felix Maurer
0c0d0f42ff xsk: Free skb when TX metadata options are invalid
When a new skb is allocated for transmitting an xsk descriptor, i.e., for
every non-multibuf descriptor or the first frag of a multibuf descriptor,
but the descriptor is later found to have invalid options set for the TX
metadata, the new skb is never freed. This can leak skbs until the send
buffer is full which makes sending more packets impossible.

Fix this by freeing the skb in the error path if we are currently dealing
with the first frag, i.e., an skb allocated in this iteration of
xsk_build_skb.

Fixes: 48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
Reported-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/edb9b00fb19e680dff5a3350cd7581c5927975a8.1731581697.git.fmaurer@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 14:26:40 -08:00
Jakub Kicinski
8807850697 netfilter pull request 24-11-14
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmc18Y4ACgkQ1w0aZmrP
 KyHKURAAwQxhSDGgEGs5Y5f851kqb36OZST7kLXAdLPv6jJlCl5x6gW9Nxo5NWoI
 inFwp5lGjha7dXbrkVi60BvkoMFcU9AhLs4RmWHBZzs3NtnbCEIlZ9LXfWuKf1rU
 1LhfUrN2UqtYRWzz4mznTW686jdEFg5kgyugI8Ja5RaLiaLQ0DNJS8IxZncYP3a6
 ZrmP5d/LUW/WZ0lRLX7s10k+ar8VartZvKr0wuKZXo8TuzjmDFf6+4l2EYbQN+A6
 tjRIpC/8pEvKhC5bvSea1Irn7+qDvapPkpPzkU5Wg+ftMUv/1ehBIBWPkrD5y8ye
 vpvQIb9Wpiyy6dPG3jtK2Y0IwyKZHf3t6mFWI5y10+GUqbYSuabILYquG5SWAbyZ
 EdWrw5fEP9Na4oeEtQpFrPKgcl20fPaxc3Q2MzpodFzUAeYCMrrxBXcToDf0yvFd
 mghsr6iTdfjJT7fT3prFIIkMalAoX1sp6rjpcP+Nd2SY7Y3nBPaiGSrF75svPbPR
 IUTJaZIgUyoOfimy78fKXMuK63r1+wXO5oDXvP2KpBUetAWEO16IULgD7zx0zIWQ
 vnwBcyiqhBzRqcfDpLxaq/wNZA9eJCFCzqRn7GmqNlEKrrGBE62M19gZnAC2hUB/
 FYfHkGT3SvSDt6im1gyNp0QKn8kSl/2bUbkf29rcl0zuu42WnUw=
 =0/FY
 -----END PGP SIGNATURE-----

Merge tag 'nf-24-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Update .gitignore in selftest to skip conntrack_reverse_clash,
   from Li Zhijian.

2) Fix conntrack_dump_flush return values, from Guan Jing.

3) syzbot found that ipset's bitmap type does not properly checks for
   bitmap's first ip, from Jeongjun Park.

* tag 'nf-24-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: ipset: add missing range check in bitmap_ip_uadt
  selftests: netfilter: Fix missing return values in conntrack_dump_flush
  selftests: netfilter: Add missing gitignore file
====================

Link: https://patch.msgid.link/20241114125723.82229-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 14:24:36 -08:00
Joe Damato
ed7231f56c netdev-genl: Hold rcu_read_lock in napi_set
Hold rcu_read_lock during netdev_nl_napi_set_doit, which calls
napi_by_id and requires rcu_read_lock to be held.

Closes: https://lore.kernel.org/netdev/719083c2-e277-447b-b6ea-ca3acb293a03@redhat.com/
Fixes: 1287c1ae0f ("netdev-genl: Support setting per-NAPI config values")
Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20241114175600.18882-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 14:21:00 -08:00
Joe Damato
c53bf100f6 netdev-genl: Hold rcu_read_lock in napi_get
Hold rcu_read_lock in netdev_nl_napi_get_doit, which calls napi_by_id
and is required to be called under rcu_read_lock.

Cc: stable@vger.kernel.org
Fixes: 27f91aaf49 ("netdev-genl: Add netlink framework functions for napi")
Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20241114175157.16604-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 14:20:19 -08:00
Jakub Kicinski
6cd663f03f bluetooth-next pull request for net-next:
- btusb: add Foxconn 0xe0fc for Qualcomm WCN785x
  - btmtk: Fix ISO interface handling
  - Add quirk for ATS2851
  - btusb: Add RTL8852BE device 0489:e123
  - ISO: Do not emit LE PA/BIG Create Sync if previous is pending
  - btusb: Add USB HW IDs for MT7920/MT7925
  - btintel_pcie: Add handshake between driver and firmware
  - btintel_pcie: Add recovery mechanism
  - hci_conn: Use disable_delayed_work_sync
  - SCO: Use kref to track lifetime of sco_conn
  - ISO: Use kref to track lifetime of iso_conn
  - btnxpuart: Add GPIO support to power save feature
  - btusb: Add 0x0489:0xe0f3 and 0x13d3:0x3623 for Qualcomm WCN785x
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmc2bBcZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKfnGEACV1YylQ9kJxzyDwyxrZtYG
 3T2I+8qQwIpokyGcYHpMC0kJFAzCbywGxjS3njpOrM0nTuvGpvLAD40Sn+pMHbKb
 3XzNNSixuxgJvr3CyDM0KOua/2nFQZyPe0DXe4D9bzOproIHDyoQkWbVqntLCyXO
 DDwwSF/CcgfZsIf5dfGir6erUBOXzYUUlCL7Q0ap2DNdeYAv6XLWwVSMu7okIFpH
 H/vBcWMWXwNNyEtDIHPmRYut14qEFASTRWsSe7IiIoL2V5VG5BVUALk8rAmoVv10
 IjE5kAmdLBHplhtnDIEg55CHIivlnbOp9d7WMhKkwY+vrPx571uFuwXeCBrg+5cd
 SyekP701TtS5oscT2SazimZTdtS0YLmJgVlhxX8DAIeElO9pEPdJt9CNCafFKmEY
 LMleGXDrH4bnTA1k6nMX2Ky4/oqlSonPbYXZ4GzL5ZMRg9biIkRI3YyRvKLM1plh
 MoO14zXhS184Cf0vSaGSeZ2nqsv7Z0lPkGJxCpGyzkOoA1VnzORl9teik5C6eeCw
 7wOoM+x+aJU8hxyD65DkyPNzLkvrEohRJWx7XMOKZEC1uFvBrJfEc/lb7TH5E+Zd
 PbPG1+x5Y3CAwvOcQzbpVeF0ujL0+KvJF+Y1q7eJ3mB+KoDPBEVbcO1ypvL6kbfW
 ETYrD/fuBiuxQE/XM0Xdlw==
 =niGG
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2024-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

 - btusb: add Foxconn 0xe0fc for Qualcomm WCN785x
 - btmtk: Fix ISO interface handling
 - Add quirk for ATS2851
 - btusb: Add RTL8852BE device 0489:e123
 - ISO: Do not emit LE PA/BIG Create Sync if previous is pending
 - btusb: Add USB HW IDs for MT7920/MT7925
 - btintel_pcie: Add handshake between driver and firmware
 - btintel_pcie: Add recovery mechanism
 - hci_conn: Use disable_delayed_work_sync
 - SCO: Use kref to track lifetime of sco_conn
 - ISO: Use kref to track lifetime of iso_conn
 - btnxpuart: Add GPIO support to power save feature
 - btusb: Add 0x0489:0xe0f3 and 0x13d3:0x3623 for Qualcomm WCN785x

* tag 'for-net-next-2024-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (51 commits)
  Bluetooth: MGMT: Add initial implementation of MGMT_OP_HCI_CMD_SYNC
  Bluetooth: fix use-after-free in device_for_each_child()
  Bluetooth: btintel: Direct exception event to bluetooth stack
  Bluetooth: hci_core: Fix calling mgmt_device_connected
  Bluetooth: hci_bcm: Use the devm_clk_get_optional() helper
  Bluetooth: ISO: Send BIG Create Sync via hci_sync
  Bluetooth: hci_conn: Remove alloc from critical section
  Bluetooth: ISO: Use kref to track lifetime of iso_conn
  Bluetooth: SCO: Use kref to track lifetime of sco_conn
  Bluetooth: HCI: Add IPC(11) bus type
  Bluetooth: btusb: Add 3 HWIDs for MT7925
  Bluetooth: btusb: Add new VID/PID 0489/e124 for MT7925
  Bluetooth: ISO: Update hci_conn_hash_lookup_big for Broadcast slave
  Bluetooth: ISO: Do not emit LE BIG Create Sync if previous is pending
  Bluetooth: ISO: Fix matching parent socket for BIS slave
  Bluetooth: ISO: Do not emit LE PA Create Sync if previous is pending
  Bluetooth: btrtl: Decrease HCI_OP_RESET timeout from 10 s to 2 s
  Bluetooth: btbcm: fix missing of_node_put() in btbcm_get_board_name()
  Bluetooth: btusb: Add new VID/PID 0489/e111 for MT7925
  Bluetooth: btmtk: adjust the position to init iso data anchor
  ...
====================

Link: https://patch.msgid.link/20241114214731.1994446-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 14:16:28 -08:00
Jakub Kicinski
26a3beee24 netfilter pull request 24-11-15
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmc3S9AACgkQ1w0aZmrP
 KyF7Sg/9GBfCiuuxUqrbigUitY8dJFuCTt+fKxMDfTb6sqU7FgQK/ylqwuW2zikz
 MgyVRXTAMbgD1KU5U+v1VEf5kq8iCU/rpdCC1xMOK9GvbaYQ9l/0cw8PR1jGgmSZ
 P1NWgmpv30IbZ/bQblU9/SbP8sFWg3DLC9lFrqYlLkJjijhfSDTflI6uVVWwt+rn
 9jWqgzf6mUYKAKJ56gFfUW/09jYPkQ5OLYz9CLqvIZLhdYNPGy2GEgldzXkHaVPv
 O65lMjrNojVYfITcinjkVfVVTlcLtQPNG9novclXrsf+qSsov5h/583n0c+7Xh3N
 r+EY1NBzZEcxLloTowJ/iq7xtDHHDG6Rv3BGTMS2JWFxhUDOV3Ks2qj/bIZUkzh5
 /Kl8n4NFbE+f1F3TGOoivZ0CFK1s3jcdIu3RTMXwwa41eiOAt8dPvhckfxTW20kT
 GdIYMNpUC1UVw2a1bxPEw27omB2UF2VADK5vHm97WJ8FBjA1HwPA9afF+PmyNMZ6
 cCOKT225DpXkt2WAMX+bgDyqQN150B05/JrBRdiT5hT5++xJn+heZLvx56L4mPA2
 8Y8NnnXLsyx5pwtE6HKgBOZNXXno2xpE/OrafF5n2zHwiMnF5qeF1+Jwerm8SxUa
 ZTuUS1mAi922IJzksnjRtiVggEA4X9Arq4NRwlIMWunTxybmHnE=
 =Tp49
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-24-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Extended netlink error reporting if nfnetlink attribute parser fails,
   from Donald Hunter.

2) Incorrect request_module() module, from Simon Horman.

3) A series of patches to reduce memory consumption for set element
   transactions.
   Florian Westphal says:

"When doing a flush on a set or mass adding/removing elements from a
set, each element needs to allocate 96 bytes to hold the transactional
state.

In such cases, virtually all the information in struct nft_trans_elem
is the same.

Change nft_trans_elem to a flex-array, i.e. a single nft_trans_elem
can hold multiple set element pointers.

The number of elements that can be stored in one nft_trans_elem is limited
by the slab allocator, this series limits the compaction to at most 62
elements as it caps the reallocation to 2048 bytes of memory."

4) A series of patches to prepare the transition to dscp_t in .flowi_tos.
   From Guillaume Nault.

5) Support for bitwise operations with two source registers,
   from Jeremy Sowden.

* tag 'nf-next-24-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: bitwise: add support for doing AND, OR and XOR directly
  netfilter: bitwise: rename some boolean operation functions
  netfilter: nf_dup4: Convert nf_dup_ipv4_route() to dscp_t.
  netfilter: nft_fib: Convert nft_fib4_eval() to dscp_t.
  netfilter: rpfilter: Convert rpfilter_mt() to dscp_t.
  netfilter: flow_offload: Convert nft_flow_route() to dscp_t.
  netfilter: ipv4: Convert ip_route_me_harder() to dscp_t.
  netfilter: nf_tables: allocate element update information dynamically
  netfilter: nf_tables: switch trans_elem to real flex array
  netfilter: nf_tables: prepare nft audit for set element compaction
  netfilter: nf_tables: prepare for multiple elements in nft_trans_elem structure
  netfilter: nf_tables: add nft_trans_commit_list_add_elem helper
  netfilter: bpf: Pass string literal as format argument of request_module()
  netfilter: nfnetlink: Report extack policy errors for batched ops
====================

Link: https://patch.msgid.link/20241115133207.8907-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15 14:09:21 -08:00
Jeremy Sowden
b0ccf4f53d netfilter: bitwise: add support for doing AND, OR and XOR directly
Hitherto, these operations have been converted in user space to
mask-and-xor operations on one register and two immediate values, and it
is the latter which have been evaluated by the kernel.  We add support
for evaluating these operations directly in kernel space on one register
and either an immediate value or a second register.

Pablo made a few changes to the original patch:

- EINVAL if NFTA_BITWISE_SREG2 is used with fast version.
- Allow _AND,_OR,_XOR with _DATA != sizeof(u32)
- Dump _SREG2 or _DATA with _AND,_OR,_XOR

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-15 12:07:04 +01:00
Jeremy Sowden
a12143e608 netfilter: bitwise: rename some boolean operation functions
In the next patch we add support for doing AND, OR and XOR operations
directly in the kernel, so rename some functions and an enum constant
related to mask-and-xor boolean operations.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-15 11:00:29 +01:00
Guillaume Nault
f0d839c13e netfilter: nf_dup4: Convert nf_dup_ipv4_route() to dscp_t.
Use ip4h_dscp() instead of reading iph->tos directly.

ip4h_dscp() returns a dscp_t value which is temporarily converted back
to __u8 with inet_dscp_to_dsfield(). When converting ->flowi4_tos to
dscp_t in the future, we'll only have to remove that
inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-15 11:00:29 +01:00
Guillaume Nault
f12b67cc7d netfilter: nft_fib: Convert nft_fib4_eval() to dscp_t.
Use ip4h_dscp() instead of reading iph->tos directly.

ip4h_dscp() returns a dscp_t value which is temporarily converted back
to __u8 with inet_dscp_to_dsfield(). When converting ->flowi4_tos to
dscp_t in the future, we'll only have to remove that
inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-15 11:00:29 +01:00
Guillaume Nault
f694ce6de5 netfilter: rpfilter: Convert rpfilter_mt() to dscp_t.
Use ip4h_dscp() instead of reading iph->tos directly.

ip4h_dscp() returns a dscp_t value which is temporarily converted back
to __u8 with inet_dscp_to_dsfield(). When converting ->flowi4_tos to
dscp_t in the future, we'll only have to remove that
inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-15 11:00:29 +01:00
Guillaume Nault
6f9615a6e6 netfilter: flow_offload: Convert nft_flow_route() to dscp_t.
Use ip4h_dscp()instead of reading ip_hdr()->tos directly.

ip4h_dscp() returns a dscp_t value which is temporarily converted back
to __u8 with inet_dscp_to_dsfield(). When converting ->flowi4_tos to
dscp_t in the future, we'll only have to remove that
inet_dscp_to_dsfield() call.

Also, remove the comment about the net/ip.h include file, since it's
now required for the ip4h_dscp() helper too.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-15 11:00:29 +01:00
Guillaume Nault
0608746f95 netfilter: ipv4: Convert ip_route_me_harder() to dscp_t.
Use ip4h_dscp()instead of reading iph->tos directly.

ip4h_dscp() returns a dscp_t value which is temporarily converted back
to __u8 with inet_dscp_to_dsfield(). When converting ->flowi4_tos to
dscp_t in the future, we'll only have to remove that
inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-15 11:00:29 +01:00
Steffen Klassert
a35672819f xfrm: Fix acquire state insertion.
A recent commit jumped over the dst hash computation and
left the symbol uninitialized. Fix this by explicitly
computing the dst hash before it is used.

Fixes: 0045e3d806 ("xfrm: Cache used outbound xfrm states at the policy.")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-11-15 07:25:14 +01:00
Edward Cree
a64499f618 net: ethtool: account for RSS+RXNFC add semantics when checking channel count
In ethtool_check_max_channel(), the new RX count must not only cover the
 max queue indices in RSS indirection tables and RXNFC destinations
 separately, but must also, for RXNFC rules with FLOW_RSS, cover the sum
 of the destination queue and the maximum index in the associated RSS
 context's indirection table, since that is the highest queue that the
 rule can actually deliver traffic to.
It could be argued that the max queue across all custom RSS contexts
 (ethtool_get_max_rss_ctx_channel()) need no longer be considered, since
 any context to which packets can actually be delivered will be targeted
 by some RXNFC rule and its max will thus be allowed for by
 ethtool_get_max_rxnfc_channel().  For simplicity we keep both checks, so
 even RSS contexts unused by any RXNFC rule must fit the channel count.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/43257d375434bef388e36181492aa4c458b88336.1731499022.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14 19:53:42 -08:00
Edward Cree
9e43ad7a1e net: ethtool: only allow set_rxnfc with rss + ring_cookie if driver opts in
Ethtool ntuple filters with FLOW_RSS were originally defined as adding
 the base queue ID (ring_cookie) to the value from the indirection table,
 so that the same table could distribute over more than one set of queues
 when used by different filters.
However, some drivers / hardware ignore the ring_cookie, and simply use
 the indirection table entries as queue IDs directly.  Thus, for drivers
 which have not opted in by setting ethtool_ops.cap_rss_rxnfc_adds to
 declare that they support the original (addition) semantics, reject in
 ethtool_set_rxnfc any filter which combines FLOW_RSS and a nonzero ring.
(For a ring_cookie of zero, both behaviours are equivalent.)
Set the cap bit in sfc, as it is known to support this feature.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
Link: https://patch.msgid.link/cc3da0844083b0e301a33092a6299e4042b65221.1731499022.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14 19:53:41 -08:00
Jakub Kicinski
55c8590129 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCZzZUwAAKCRAraS+Z+3y5
 E54pAP9kim6BVXVngcMBmyAKa1Fr0zLGj/Ds1JB+KFfQ/0v80wD/ebVpoIEoKHs9
 /Xl/3WfN3JzIi9+mqIauENH6DTUQPAo=
 =MWOY
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2024-11-14

We've added 9 non-merge commits during the last 4 day(s) which contain
a total of 3 files changed, 226 insertions(+), 84 deletions(-).

The main changes are:

1) Fixes to bpf_msg_push/pop_data and test_sockmap. The changes has
   dependency on the other changes in the bpf-next/net branch,
   from Zijian Zhang.

2) Drop netns codes from mptcp test. Reuse the common helpers in
   test_progs, from Geliang Tang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  bpf, sockmap: Fix sk_msg_reset_curr
  bpf, sockmap: Several fixes to bpf_msg_pop_data
  bpf, sockmap: Several fixes to bpf_msg_push_data
  selftests/bpf: Add more tests for test_txmsg_push_pop in test_sockmap
  selftests/bpf: Add push/pop checking for msg_verify_data in test_sockmap
  selftests/bpf: Fix total_bytes in msg_loop_rx in test_sockmap
  selftests/bpf: Fix SENDPAGE data logic in test_sockmap
  selftests/bpf: Add txmsg_pass to pull/push/pop in test_sockmap
  selftests/bpf: Drop netns helpers in mptcp
====================

Link: https://patch.msgid.link/20241114202832.3187927-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14 19:08:05 -08:00
Guillaume Nault
dab9c63071 bpf: lwtunnel: Prepare bpf_lwt_xmit_reroute() to future .flowi4_tos conversion.
Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the
dscp_t value to __u8 with inet_dscp_to_dsfield().

Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop
the inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/8338a12377c44f698a651d1ce357dd92bdf18120.1731064982.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14 19:07:49 -08:00
Guillaume Nault
bfe086be5c bpf: ipv4: Prepare __bpf_redirect_neigh_v4() to future .flowi4_tos conversion.
Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the
dscp_t value to __u8 with inet_dscp_to_dsfield().

Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop
the inet_dscp_to_dsfield() call.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/35eacc8955003e434afb1365d404193cc98a9579.1731064982.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14 19:07:48 -08:00
Luiz Augusto von Dentz
827af4787e Bluetooth: MGMT: Add initial implementation of MGMT_OP_HCI_CMD_SYNC
This adds the initial implementation of MGMT_OP_HCI_CMD_SYNC as
documented in mgmt-api (BlueZ tree):

Send HCI command and wait for event Command
===========================================

	Command Code:		0x005B
	Controller Index:	<controller id>
	Command Parameters:	Opcode (2 Octets)
				Event (1 Octet)
				Timeout (1 Octet)
				Parameter Length (2 Octets)
				Parameter (variable)
	Return Parameters:	Response (1-variable Octets)

	This command may be used to send a HCI command and wait for an
	(optional) event.

	The HCI command is specified by the Opcode, any arbitrary is supported
	including vendor commands, but contrary to the like of
	Raw/User channel it is run as an HCI command send by the kernel
	since it uses its command synchronization thus it is possible to wait
	for a specific event as a response.

	Setting event to 0x00 will cause the command to wait for either
	HCI Command Status or HCI Command Complete.

	Timeout is specified in seconds, setting it to 0 will cause the
	default timeout to be used.

	Possible errors:	Failed
				Invalid Parameters

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:41:31 -05:00
Dmitry Antipov
27aabf27fd Bluetooth: fix use-after-free in device_for_each_child()
Syzbot has reported the following KASAN splat:

BUG: KASAN: slab-use-after-free in device_for_each_child+0x18f/0x1a0
Read of size 8 at addr ffff88801f605308 by task kbnepd bnep0/4980

CPU: 0 UID: 0 PID: 4980 Comm: kbnepd bnep0 Not tainted 6.12.0-rc4-00161-gae90f6a6170d #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x100/0x190
 ? device_for_each_child+0x18f/0x1a0
 print_report+0x13a/0x4cb
 ? __virt_addr_valid+0x5e/0x590
 ? __phys_addr+0xc6/0x150
 ? device_for_each_child+0x18f/0x1a0
 kasan_report+0xda/0x110
 ? device_for_each_child+0x18f/0x1a0
 ? __pfx_dev_memalloc_noio+0x10/0x10
 device_for_each_child+0x18f/0x1a0
 ? __pfx_device_for_each_child+0x10/0x10
 pm_runtime_set_memalloc_noio+0xf2/0x180
 netdev_unregister_kobject+0x1ed/0x270
 unregister_netdevice_many_notify+0x123c/0x1d80
 ? __mutex_trylock_common+0xde/0x250
 ? __pfx_unregister_netdevice_many_notify+0x10/0x10
 ? trace_contention_end+0xe6/0x140
 ? __mutex_lock+0x4e7/0x8f0
 ? __pfx_lock_acquire.part.0+0x10/0x10
 ? rcu_is_watching+0x12/0xc0
 ? unregister_netdev+0x12/0x30
 unregister_netdevice_queue+0x30d/0x3f0
 ? __pfx_unregister_netdevice_queue+0x10/0x10
 ? __pfx_down_write+0x10/0x10
 unregister_netdev+0x1c/0x30
 bnep_session+0x1fb3/0x2ab0
 ? __pfx_bnep_session+0x10/0x10
 ? __pfx_lock_release+0x10/0x10
 ? __pfx_woken_wake_function+0x10/0x10
 ? __kthread_parkme+0x132/0x200
 ? __pfx_bnep_session+0x10/0x10
 ? kthread+0x13a/0x370
 ? __pfx_bnep_session+0x10/0x10
 kthread+0x2b7/0x370
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x48/0x80
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Allocated by task 4974:
 kasan_save_stack+0x30/0x50
 kasan_save_track+0x14/0x30
 __kasan_kmalloc+0xaa/0xb0
 __kmalloc_noprof+0x1d1/0x440
 hci_alloc_dev_priv+0x1d/0x2820
 __vhci_create_device+0xef/0x7d0
 vhci_write+0x2c7/0x480
 vfs_write+0x6a0/0xfc0
 ksys_write+0x12f/0x260
 do_syscall_64+0xc7/0x250
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 4979:
 kasan_save_stack+0x30/0x50
 kasan_save_track+0x14/0x30
 kasan_save_free_info+0x3b/0x60
 __kasan_slab_free+0x4f/0x70
 kfree+0x141/0x490
 hci_release_dev+0x4d9/0x600
 bt_host_release+0x6a/0xb0
 device_release+0xa4/0x240
 kobject_put+0x1ec/0x5a0
 put_device+0x1f/0x30
 vhci_release+0x81/0xf0
 __fput+0x3f6/0xb30
 task_work_run+0x151/0x250
 do_exit+0xa79/0x2c30
 do_group_exit+0xd5/0x2a0
 get_signal+0x1fcd/0x2210
 arch_do_signal_or_restart+0x93/0x780
 syscall_exit_to_user_mode+0x140/0x290
 do_syscall_64+0xd4/0x250
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

In 'hci_conn_del_sysfs()', 'device_unregister()' may be called when
an underlying (kobject) reference counter is greater than 1. This
means that reparenting (happened when the device is actually freed)
is delayed and, during that delay, parent controller device (hciX)
may be deleted. Since the latter may create a dangling pointer to
freed parent, avoid that scenario by reparenting to NULL explicitly.

Reported-by: syzbot+6cf5652d3df49fae2e3f@syzkaller.appspotmail.com
Tested-by: syzbot+6cf5652d3df49fae2e3f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6cf5652d3df49fae2e3f
Fixes: a85fb91e3d ("Bluetooth: Fix double free in hci_conn_cleanup")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:41:13 -05:00
Luiz Augusto von Dentz
55abbd148d Bluetooth: hci_core: Fix calling mgmt_device_connected
Since 61a939c68e ("Bluetooth: Queue incoming ACL data until
BT_CONNECTED state is reached") there is no long the need to call
mgmt_device_connected as ACL data will be queued until BT_CONNECTED
state.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=219458
Link: https://github.com/bluez/bluez/issues/1014
Fixes: 333b4fd11e ("Bluetooth: L2CAP: Fix uaf in l2cap_connect")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:40:35 -05:00
Iulia Tanasescu
07a9342b94 Bluetooth: ISO: Send BIG Create Sync via hci_sync
Before issuing the LE BIG Create Sync command, an available BIG handle
is chosen by iterating through the conn_hash list and finding the first
unused value.

If a BIG is terminated, the associated hcons are removed from the list
and the LE BIG Terminate Sync command is sent via hci_sync queue.
However, a new LE BIG Create sync command might be issued via
hci_send_cmd, before the previous BIG sync was terminated. This
can cause the same BIG handle to be reused and the LE BIG Create Sync
to fail with Command Disallowed.

< HCI Command: LE Broadcast Isochronous Group Create Sync (0x08|0x006b)
        BIG Handle: 0x00
        BIG Sync Handle: 0x0002
        Encryption: Unencrypted (0x00)
        Broadcast Code[16]: 00000000000000000000000000000000
        Maximum Number Subevents: 0x00
        Timeout: 20000 ms (0x07d0)
        Number of BIS: 1
        BIS ID: 0x01
> HCI Event: Command Status (0x0f) plen 4
      LE Broadcast Isochronous Group Create Sync (0x08|0x006b) ncmd 1
        Status: Command Disallowed (0x0c)
< HCI Command: LE Broadcast Isochronous Group Terminate Sync (0x08|0x006c)
        BIG Handle: 0x00

This commit fixes the ordering of the LE BIG Create Sync/LE BIG Terminate
Sync commands, to make sure that either the previous BIG sync is
terminated before reusing the handle, or that a new handle is chosen
for a new sync.

Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:39:59 -05:00
Iulia Tanasescu
25ab2db3e6 Bluetooth: hci_conn: Remove alloc from critical section
This removes the kzalloc memory allocation inside critical section in
create_pa_sync, fixing the following message that appears when the kernel
is compiled with CONFIG_DEBUG_ATOMIC_SLEEP enabled:

BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:321

Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:39:40 -05:00
Luiz Augusto von Dentz
dc26097bdb Bluetooth: ISO: Use kref to track lifetime of iso_conn
This make use of kref to keep track of reference of iso_conn which
allows better tracking of its lifetime with usage of things like
kref_get_unless_zero in a similar way as used in l2cap_chan.

In addition to it remove call to iso_sock_set_timer on iso_sock_disconn
since at that point it is useless to set a timer as the sk will be freed
there is nothing to be done in iso_sock_timeout.

Fixes: ccf74f2390 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:39:22 -05:00
Luiz Augusto von Dentz
e6720779ae Bluetooth: SCO: Use kref to track lifetime of sco_conn
This make use of kref to keep track of reference of sco_conn which
allows better tracking of its lifetime with usage of things like
kref_get_unless_zero in a similar way as used in l2cap_chan.

In addition to it remove call to sco_sock_set_timer on __sco_sock_close
since at that point it is useless to set a timer as the sk will be freed
there is nothing to be done in sco_sock_timeout.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:39:03 -05:00
Iulia Tanasescu
83d328a72e Bluetooth: ISO: Update hci_conn_hash_lookup_big for Broadcast slave
Currently, hci_conn_hash_lookup_big only checks for BIS master connections,
by filtering out connections with the destination address set. This commit
updates this function to also consider BIS slave connections, since it is
also used for a Broadcast Receiver to set an available BIG handle before
issuing the LE BIG Create Sync command.

Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:37:31 -05:00
Iulia Tanasescu
42ecf19471 Bluetooth: ISO: Do not emit LE BIG Create Sync if previous is pending
The Bluetooth Core spec does not allow a LE BIG Create sync command to be
sent to Controller if another one is pending (Vol 4, Part E, page 2586).

In order to avoid this issue, the HCI_CONN_CREATE_BIG_SYNC was added
to mark that the LE BIG Create Sync command has been sent for a hcon.
Once the BIG Sync Established event is received, the hcon flag is
erased and the next pending hcon is handled.

Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:37:02 -05:00
Iulia Tanasescu
79321b06a0 Bluetooth: ISO: Fix matching parent socket for BIS slave
Currently, when a BIS slave connection is notified to the
ISO layer, the parent socket is tried to be matched by the
HCI_EVT_LE_BIG_SYNC_ESTABILISHED event. However, a BIS slave
connection is notified to the ISO layer after the Command
Complete for the LE Setup ISO Data Path command is received.
This causes the parent to be incorrectly matched if multiple
listen sockets are present.

This commit adds a fix by matching the parent based on the
BIG handle set in the notified connection.

Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:36:43 -05:00
Iulia Tanasescu
4a5e0ba686 Bluetooth: ISO: Do not emit LE PA Create Sync if previous is pending
The Bluetooth Core spec does not allow a LE PA Create sync command to be
sent to Controller if another one is pending (Vol 4, Part E, page 2493).

In order to avoid this issue, the HCI_CONN_CREATE_PA_SYNC was added
to mark that the LE PA Create Sync command has been sent for a hcon.
Once the PA Sync Established event is received, the hcon flag is
erased and the next pending hcon is handled.

Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:36:15 -05:00
Danil Pylaev
5bd3135924 Bluetooth: Support new quirks for ATS2851
This adds support for quirks for broken extended create connection,
and write auth payload timeout.

Signed-off-by: Danil Pylaev <danstiv404@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:33:57 -05:00
Andrej Shadura
5fe6caa62b Bluetooth: Fix type of len in rfcomm_sock_getsockopt{,_old}()
Commit 9bf4e919cc worked around an issue introduced after an innocuous
optimisation change in LLVM main:

> len is defined as an 'int' because it is assigned from
> '__user int *optlen'. However, it is clamped against the result of
> sizeof(), which has a type of 'size_t' ('unsigned long' for 64-bit
> platforms). This is done with min_t() because min() requires compatible
> types, which results in both len and the result of sizeof() being casted
> to 'unsigned int', meaning len changes signs and the result of sizeof()
> is truncated. From there, len is passed to copy_to_user(), which has a
> third parameter type of 'unsigned long', so it is widened and changes
> signs again. This excessive casting in combination with the KCSAN
> instrumentation causes LLVM to fail to eliminate the __bad_copy_from()
> call, failing the build.

The same issue occurs in rfcomm in functions rfcomm_sock_getsockopt and
rfcomm_sock_getsockopt_old.

Change the type of len to size_t in both rfcomm_sock_getsockopt and
rfcomm_sock_getsockopt_old and replace min_t() with min().

Cc: stable@vger.kernel.org
Co-authored-by: Aleksei Vetrov <vvvvvv@google.com>
Improves: 9bf4e919cc ("Bluetooth: Fix type of len in {l2cap,sco}_sock_getsockopt_old()")
Link: https://github.com/ClangBuiltLinux/linux/issues/2007
Link: https://github.com/llvm/llvm-project/issues/85647
Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:32:47 -05:00
Luiz Augusto von Dentz
59437cbb57 Bluetooth: hci_core: Fix not checking skb length on hci_scodata_packet
This fixes not checking if skb really contains an SCO header otherwise
the code may attempt to access some uninitilized/invalid memory past the
valid skb->data.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:30:14 -05:00
Luiz Augusto von Dentz
3fe288a821 Bluetooth: hci_core: Fix not checking skb length on hci_acldata_packet
This fixes not checking if skb really contains an ACL header otherwise
the code may attempt to access some uninitilized/invalid memory past the
valid skb->data.

Reported-by: syzbot+6ea290ba76d8c1eb1ac2@syzkaller.appspotmail.com
Tested-by: syzbot+6ea290ba76d8c1eb1ac2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6ea290ba76d8c1eb1ac2
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:29:54 -05:00
Luiz Augusto von Dentz
2b0f2fc9ed Bluetooth: hci_conn: Use disable_delayed_work_sync
This makes use of disable_delayed_work_sync instead
cancel_delayed_work_sync as it not only cancel the ongoing work but also
disables new submit which is disarable since the object holding the work
is about to be freed.

Reported-by: syzbot+2446dd3cb07277388db6@syzkaller.appspotmail.com
Tested-by: syzbot+2446dd3cb07277388db6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2446dd3cb07277388db6
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:29:02 -05:00
Markus Elfring
d96b543c6f Bluetooth: hci_conn: Reduce hci_conn_drop() calls in two functions
An hci_conn_drop() call was immediately used after a null pointer check
for an hci_conn_link() call in two function implementations.
Thus call such a function only once instead directly before the checks.

This issue was transformed by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-14 15:27:47 -05:00
Jakub Kicinski
a79993b5fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.12-rc8).

Conflicts:

tools/testing/selftests/net/.gitignore
  252e01e682 ("selftests: net: add netlink-dumps to .gitignore")
  be43a6b238 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw")
https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/

drivers/net/phy/phylink.c
  671154f174 ("net: phylink: ensure PHY momentary link-fails are handled")
  7530ea26c8 ("net: phylink: remove "using_mac_select_pcs"")

Adjacent changes:

drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c
  5b366eae71 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines")
  e96321fad3 ("net: ethernet: Switch back to struct platform_driver::remove()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14 11:29:15 -08:00
Linus Torvalds
cfaaa7d010 Including fixes from bluetooth.
Quite calm week. No new regression under investigation.
 
 Current release - regressions:
 
   - eth: revert "igb: Disable threaded IRQ for igb_msix_other"
 
 Current release - new code bugs:
 
   - bluetooth: btintel: direct exception event to bluetooth stack
 
 Previous releases - regressions:
 
   - core: fix data-races around sk->sk_forward_alloc
 
   - netlink: terminate outstanding dump on socket close
 
   - mptcp: error out earlier on disconnect
 
   - vsock: fix accept_queue memory leak
 
   - phylink: ensure PHY momentary link-fails are handled
 
   - eth: mlx5:
     - fix null-ptr-deref in add rule err flow
     - lock FTE when checking if active
 
   - eth: dwmac-mediatek: fix inverted handling of mediatek,mac-wol
 
 Previous releases - always broken:
 
   - sched: fix u32's systematic failure to free IDR entries for hnodes.
 
   - sctp: fix possible UAF in sctp_v6_available()
 
   - eth: bonding: add ns target multicast address to slave device
 
   - eth: mlx5: fix msix vectors to respect platform limit
 
   - eth: icssg-prueth: fix 1 PPS sync
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmc174YSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgKIP/3Lk1byZ0dKvxsvyBBDeF7yBOsKRsjLt
 XfrkcxkS/nloDkh8hM8gLiXjzHuHSo8p7YQ8eaZ215FAozkQbTHnyVUiokDY4vqz
 VwCqcHZBTCVZNntOK//lP20wE/FDPrLrRIAflshXHuJv+GBZDKUrjBiyiWhyXltv
 slcj7pW9mQyk/AaRW2n3jF985mBxgSXzNI1agDonq/+yP2R35GMO+jIqJHZ9CLH3
 GZakZs6ZVWqKbk3/U9qhH9nZsVwBt18eqwkaFYszOc8eMlSp0j9yLmdPfbYcLjbe
 tIu/wTF70iHlgw/fbPMWA6dsaf/vN9U96qG3YRH+zwvWUGFYcq/gRSeXceI6/N5u
 EAn8Y1IKXiCdCLd1iRyYZqRhHhnpCkbnx9TURdsCclbFW9bf+BU0MjEP3xfq84sD
 gbO0RXg4ZS2uUFC4EdNkKIMyqLkMcwQMkioGlUM14oXpU0mQDh3BQrS6yrOvH3d6
 YewK7viNYpUlRt54ISTSFSVDff0AAHIWSlNOdH5xLD6YosA+aCJk6icTlmINlx1a
 +ccPDY+dH0Dzwx9n0L6hPodVZeax1elnYLlhkgEFgh8v9Tz8TDjCAN2iI/R1A+QJ
 r80OZG5qXY89BsCvBXz35svDnFucDkIMupVW88kbfgWeRZrzlFn44CFnLT3n08iT
 KMNrKktGlXCg
 =2o5W
 -----END PGP SIGNATURE-----

Merge tag 'net-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bluetooth.

  Quite calm week. No new regression under investigation.

  Current release - regressions:

   - eth: revert "igb: Disable threaded IRQ for igb_msix_other"

  Current release - new code bugs:

   - bluetooth: btintel: direct exception event to bluetooth stack

  Previous releases - regressions:

   - core: fix data-races around sk->sk_forward_alloc

   - netlink: terminate outstanding dump on socket close

   - mptcp: error out earlier on disconnect

   - vsock: fix accept_queue memory leak

   - phylink: ensure PHY momentary link-fails are handled

   - eth: mlx5:
      - fix null-ptr-deref in add rule err flow
      - lock FTE when checking if active

   - eth: dwmac-mediatek: fix inverted handling of mediatek,mac-wol

  Previous releases - always broken:

   - sched: fix u32's systematic failure to free IDR entries for hnodes.

   - sctp: fix possible UAF in sctp_v6_available()

   - eth: bonding: add ns target multicast address to slave device

   - eth: mlx5: fix msix vectors to respect platform limit

   - eth: icssg-prueth: fix 1 PPS sync"

* tag 'net-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
  net: sched: u32: Add test case for systematic hnode IDR leaks
  selftests: bonding: add ns multicast group testing
  bonding: add ns target multicast address to slave device
  net: ti: icssg-prueth: Fix 1 PPS sync
  stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines
  net: Make copy_safe_from_sockptr() match documentation
  net: stmmac: dwmac-mediatek: Fix inverted handling of mediatek,mac-wol
  ipmr: Fix access to mfc_cache_list without lock held
  samples: pktgen: correct dev to DEV
  net: phylink: ensure PHY momentary link-fails are handled
  mptcp: pm: use _rcu variant under rcu_read_lock
  mptcp: hold pm lock when deleting entry
  mptcp: update local address flags when setting it
  net: sched: cls_u32: Fix u32's systematic failure to free IDR entries for hnodes.
  MAINTAINERS: Re-add cancelled Renesas driver sections
  Revert "igb: Disable threaded IRQ for igb_msix_other"
  Bluetooth: btintel: Direct exception event to bluetooth stack
  Bluetooth: hci_core: Fix calling mgmt_device_connected
  virtio/vsock: Improve MSG_ZEROCOPY error handling
  vsock: Fix sk_error_queue memory leak
  ...
2024-11-14 10:05:33 -08:00
Jeongjun Park
35f56c554e netfilter: ipset: add missing range check in bitmap_ip_uadt
When tb[IPSET_ATTR_IP_TO] is not present but tb[IPSET_ATTR_CIDR] exists,
the values of ip and ip_to are slightly swapped. Therefore, the range check
for ip should be done later, but this part is missing and it seems that the
vulnerability occurs.

So we should add missing range checks and remove unnecessary range checks.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot+58c872f7790a4d2ac951@syzkaller.appspotmail.com
Fixes: 72205fc68b ("netfilter: ipset: bitmap:ip set type support")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-14 13:47:26 +01:00
Florian Westphal
508180850b netfilter: nf_tables: allocate element update information dynamically
Move the timeout/expire/flag members from nft_trans_one_elem struct into
a dybamically allocated structure, only needed when timeout update was
requested.

This halves size of nft_trans_one_elem struct and allows to compact up to
124 elements in one transaction container rather than 62.

This halves memory requirements for a large flush or insert transaction,
where ->update remains NULL.

Care has to be taken to release the extra data in all spots, including
abort path.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-14 13:05:49 +01:00
Florian Westphal
b0c4946604 netfilter: nf_tables: switch trans_elem to real flex array
When queueing a set element add or removal operation to the transaction
log, check if the previous operation already asks for a the identical
operation on the same set.

If so, store the element reference in the preceding operation.
This significantlty reduces memory consumption when many set add/delete
operations appear in a single transaction.

Example: 10k elements require 937kb of memory (10k allocations from
kmalloc-96 slab).

Assuming we can compact 4 elements in the same set, 468 kbytes
are needed (64 bytes for base struct, nft_trans_elemn, 32 bytes
for nft_trans_one_elem structure, so 2500 allocations from kmalloc-192
slab).

For large batch updates we can compact up to 62 elements
into one single nft_trans_elem structure (~65% mem reduction):
(64 bytes for base struct, nft_trans_elem, 32 byte for nft_trans_one_elem
 struct).

We can halve size of nft_trans_one_elem struct by moving
timeout/expire/update_flags into a dynamically allocated structure,
this allows to store 124 elements in a 2k slab nft_trans_elem struct.
This is done in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-14 12:40:55 +01:00
Florian Westphal
466c9b3b2a netfilter: nf_tables: prepare nft audit for set element compaction
nftables audit log format emits the number of added/deleted rules, sets,
set elements and so on, to userspace:

    table=t1 family=2 entries=4 op=nft_register_set
                      ~~~~~~~~~

At this time, the 'entries' key is the number of transactions that will
be applied.

The upcoming set element compression will coalesce subsequent
adds/deletes to the same set requests in the same transaction
request to conseve memory.

Without this patch, we'd under-report the number of altered elements.

Increment the audit counter by the number of elements to keep the reported
entries value the same.

Without this, nft_audit.sh selftest fails because the recorded
(expected) entries key is smaller than the expected one.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-14 12:40:55 +01:00
Florian Westphal
a8ee6b900c netfilter: nf_tables: prepare for multiple elements in nft_trans_elem structure
Add helpers to release the individual elements contained in the
trans_elem container structure.

No functional change intended.

Followup patch will add 'nelems' member and will turn 'priv' into
a flexible array.

These helpers can then loop over all elements.
Care needs to be taken to handle a mix of new elements and existing
elements that are being updated (e.g. timeout refresh).

Before this patch, NEWSETELEM transaction with update is released
early so nft_trans_set_elem_destroy() won't get called, so we need
to skip elements marked as update.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-14 12:40:37 +01:00
Florian Westphal
4ee2918121 netfilter: nf_tables: add nft_trans_commit_list_add_elem helper
Add and use a wrapper to append trans_elem structures to the
transaction log.

Unlike the existing helper, pass a gfp_t to indicate if sleeping
is allowed.

This will be used by a followup patch to realloc nft_trans_elem
structures after they gain a flexible array member to reduce
number of such container structures on the transaction list.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-14 12:40:08 +01:00
Simon Horman
8340b0056a netfilter: bpf: Pass string literal as format argument of request_module()
Both gcc-14 and clang-18 report that passing a non-string literal as the
format argument of request_module() is potentially insecure.

E.g. clang-18 says:

.../nf_bpf_link.c:46:24: warning: format string is not a string literal (potentially insecure) [-Wformat-security]
   46 |                 err = request_module(mod);
      |                                      ^~~
.../kmod.h:25:55: note: expanded from macro 'request_module'
   25 | #define request_module(mod...) __request_module(true, mod)
      |                                                       ^~~
.../nf_bpf_link.c:46:24: note: treat the string as an argument to avoid this
   46 |                 err = request_module(mod);
      |                                      ^
      |                                      "%s",
.../kmod.h:25:55: note: expanded from macro 'request_module'
   25 | #define request_module(mod...) __request_module(true, mod)
      |                                                       ^

It is always the case where the contents of mod is safe to pass as the
format argument. That is, in my understanding, it never contains any
format escape sequences.

But, it seems better to be safe than sorry. And, as a bonus, compiler
output becomes less verbose by addressing this issue as suggested by
clang-18.

Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-14 12:39:50 +01:00
Donald Hunter
3f54959628 netfilter: nfnetlink: Report extack policy errors for batched ops
The nftables batch processing does not currently populate extack with
policy errors. Fix this by passing extack when parsing batch messages.

Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-11-14 12:39:40 +01:00
Daniel Yang
9e1a6db68e xfrm: replace deprecated strncpy with strscpy_pad
The function strncpy is deprecated since it does not guarantee the
destination buffer is NULL terminated. Recommended replacement is
strscpy. The padded version was used to remain consistent with the other
strscpy_pad usage in the modified function.

Signed-off-by: Daniel Yang <danielyangkang@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-11-14 11:38:37 +01:00
Everest K.C
9d287e70c5 xfrm: Add error handling when nla_put_u32() returns an error
Error handling is missing when call to nla_put_u32() fails.
Handle the error when the call to nla_put_u32() returns an error.

The error was reported by Coverity Scan.
Report:
CID 1601525: (#1 of 1): Unused value (UNUSED_VALUE)
returned_value: Assigning value from nla_put_u32(skb, XFRMA_SA_PCPU, x->pcpu_num)
to err here, but that stored value is overwritten before it can be used

Fixes: 1ddf9916ac ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Everest K.C. <everestkc@everestkc.com.np>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-11-14 08:23:15 +01:00
Breno Leitao
e28acc9c1c ipmr: Fix access to mfc_cache_list without lock held
Accessing `mr_table->mfc_cache_list` is protected by an RCU lock. In the
following code flow, the RCU read lock is not held, causing the
following error when `RCU_PROVE` is not held. The same problem might
show up in the IPv6 code path.

	6.12.0-rc5-kbuilder-01145-gbac17284bdcb #33 Tainted: G            E    N
	-----------------------------
	net/ipv4/ipmr_base.c:313 RCU-list traversed in non-reader section!!

	rcu_scheduler_active = 2, debug_locks = 1
		   2 locks held by RetransmitAggre/3519:
		    #0: ffff88816188c6c0 (nlk_cb_mutex-ROUTE){+.+.}-{3:3}, at: __netlink_dump_start+0x8a/0x290
		    #1: ffffffff83fcf7a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_dumpit+0x6b/0x90

	stack backtrace:
		    lockdep_rcu_suspicious
		    mr_table_dump
		    ipmr_rtm_dumproute
		    rtnl_dump_all
		    rtnl_dumpit
		    netlink_dump
		    __netlink_dump_start
		    rtnetlink_rcv_msg
		    netlink_rcv_skb
		    netlink_unicast
		    netlink_sendmsg

This is not a problem per see, since the RTNL lock is held here, so, it
is safe to iterate in the list without the RCU read lock, as suggested
by Eric.

To alleviate the concern, modify the code to use
list_for_each_entry_rcu() with the RTNL-held argument.

The annotation will raise an error only if RTNL or RCU read lock are
missing during iteration, signaling a legitimate problem, otherwise it
will avoid this false positive.

This will solve the IPv6 case as well, since ip6mr_rtm_dumproute() calls
this function as well.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241108-ipmr_rcu-v2-1-c718998e209b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-13 19:09:42 -08:00
Matthieu Baerts (NGI0)
db3eab8110 mptcp: pm: use _rcu variant under rcu_read_lock
In mptcp_pm_create_subflow_or_signal_addr(), rcu_read_(un)lock() are
used as expected to iterate over the list of local addresses, but
list_for_each_entry() was used instead of list_for_each_entry_rcu() in
__lookup_addr(). It is important to use this variant which adds the
required READ_ONCE() (and diagnostic checks if enabled).

Because __lookup_addr() is also used in mptcp_pm_nl_set_flags() where it
is called under the pernet->lock and not rcu_read_lock(), an extra
condition is then passed to help the diagnostic checks making sure
either the associated spin lock or the RCU lock is held.

Fixes: 86e39e0448 ("mptcp: keep track of local endpoint still available for each msk")
Cc: stable@vger.kernel.org
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241112-net-mptcp-misc-6-12-pm-v1-3-b835580cefa8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-13 18:51:02 -08:00
Geliang Tang
f642c5c4d5 mptcp: hold pm lock when deleting entry
When traversing userspace_pm_local_addr_list and deleting an entry from
it in mptcp_pm_nl_remove_doit(), msk->pm.lock should be held.

This patch holds this lock before mptcp_userspace_pm_lookup_addr_by_id()
and releases it after list_move() in mptcp_pm_nl_remove_doit().

Fixes: d9a4594eda ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Cc: stable@vger.kernel.org
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241112-net-mptcp-misc-6-12-pm-v1-2-b835580cefa8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-13 18:51:02 -08:00
Geliang Tang
e026631941 mptcp: update local address flags when setting it
Just like in-kernel pm, when userspace pm does set_flags, it needs to send
out MP_PRIO signal, and also modify the flags of the corresponding address
entry in the local address list. This patch implements the missing logic.

Traverse all address entries on userspace_pm_local_addr_list to find the
local address entry, if bkup is true, set the flags of this entry with
FLAG_BACKUP, otherwise, clear FLAG_BACKUP.

Fixes: 892f396c8e ("mptcp: netlink: issue MP_PRIO signals from userspace PMs")
Cc: stable@vger.kernel.org
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241112-net-mptcp-misc-6-12-pm-v1-1-b835580cefa8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-13 18:51:02 -08:00
Jakub Kicinski
5c46638540 wireless-next patches for v6.13
Most likely the last -next pull request for v6.13. Most changes are in
 Realtek and Qualcomm drivers, otherwise not really anything
 noteworthy.
 
 Major changes:
 
 mac80211
 
 * EHT 1024 aggregation size for transmissions
 
 ath12k
 
 * switch to using wiphy_lock() and remove ar->conf_mutex
 
 * firmware coredump collection support
 
 * add debugfs support for a multitude of statistics
 
 ath11k
 
 * dt: document WCN6855 hardware inputs
 
 ath9k
 
 * remove include/linux/ath9k_platform.h
 
 ath5k
 
 * Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support
 
 rtw88:
 
 * 8821au and 8812au USB adapters support
 
 rtw89
 
 * thermal protection
 
 * firmware secure boot for WiFi 6 chip
 -----BEGIN PGP SIGNATURE-----
 
 iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmc04UYRHGt2YWxvQGtl
 cm5lbC5vcmcACgkQbhckVSbrbZuckgf/RV0zy8gMuzJ/cSk1GDKoOYmEwAZ4JvtW
 teAKghsODDW/bng2iKnXphJyx3spZRCNuvOmfPcHsWoResX+vqrKJOaER/3159OF
 68xAPZNXPRF4M693IpIUB/P3uTw/jieXPI7ftSPuUOhStca/ALwQd5Lp3kNKkVtq
 HipXJwCenVS7Hd8DdHbpvYFUckRWr3tHPFlOgG3qOQOVvfRen2z9rhM14oK9rn+h
 f309ATHKTbpTKNagOPYAYcyHs3zE59hlVRgRqHL7Ew0a0HI8uPJ4KK2n5W6tZJFN
 swhoQolc1uXrRYlZ3Bdr7mKSIqt557kRz7NJ9ITe7KKCU0CxM/7nhQ==
 =v8bS
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2024-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Kalle Valo says:

====================
wireless-next patches for v6.13

Most likely the last -next pull request for v6.13. Most changes are in
Realtek and Qualcomm drivers, otherwise not really anything
noteworthy.

Major changes:

mac80211
 * EHT 1024 aggregation size for transmissions

ath12k
 * switch to using wiphy_lock() and remove ar->conf_mutex
 * firmware coredump collection support
 * add debugfs support for a multitude of statistics

ath11k
 * dt: document WCN6855 hardware inputs

ath9k
 * remove include/linux/ath9k_platform.h

ath5k
 * Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support

rtw88:
 * 8821au and 8812au USB adapters support

rtw89
 * thermal protection
 * firmware secure boot for WiFi 6 chip

* tag 'wireless-next-2024-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (154 commits)
  Revert "wifi: iwlegacy: do not skip frames with bad FCS"
  wifi: mac80211: pass MBSSID config by reference
  wifi: mac80211: Support EHT 1024 aggregation size in TX
  net: rfkill: gpio: Add check for clk_enable()
  wifi: brcmfmac: Fix oops due to NULL pointer dereference in brcmf_sdiod_sglist_rw()
  wifi: Switch back to struct platform_driver::remove()
  wifi: ipw2x00: libipw_rx_any(): fix bad alignment
  wifi: brcmfmac: release 'root' node in all execution paths
  wifi: iwlwifi: mvm: don't call power_update_mac in fast suspend
  wifi: iwlwifi: s/IWL_MVM_INVALID_STA/IWL_INVALID_STA
  wifi: iwlwifi: bump minimum API version in BZ/SC to 92
  wifi: iwlwifi: move IWL_LMAC_*_INDEX to fw/api/context.h
  wifi: iwlwifi: be less noisy if the NIC is dead in S3
  wifi: iwlwifi: mvm: tell iwlmei when we finished suspending
  wifi: iwlwifi: allow fast resume on ax200
  wifi: iwlwifi: mvm: support new initiator and responder command version
  wifi: iwlwifi: mvm: use wiphy locked debugfs for low-latency
  wifi: iwlwifi: mvm: MLO scan upon channel condition degradation
  wifi: iwlwifi: mvm: support new versions of the wowlan APIs
  wifi: iwlwifi: mvm: allow always calling iwl_mvm_get_bss_vif()
  ...
====================

Link: https://patch.msgid.link/20241113172918.A8A11C4CEC3@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-13 18:35:19 -08:00
Alexei Starovoitov
8714381703 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR.

In particular to bring the fix in
commit aa30eb3260 ("bpf: Force checkpoint when jmp history is too long").
The follow up verifier work depends on it.
And the fix in
commit 6801cf7890 ("selftests/bpf: Use -4095 as the bad address for bits iterator").
It's fixing instability of BPF CI on s390 arch.

No conflicts.

Adjacent changes in:
Auto-merging arch/Kconfig
Auto-merging kernel/bpf/helpers.c
Auto-merging kernel/bpf/memalloc.c
Auto-merging kernel/bpf/verifier.c
Auto-merging mm/slab_common.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-13 12:52:51 -08:00
Linus Torvalds
9f8e716d46 BPF fixes:
- Fix a mismatching RCU unlock flavor in bpf_out_neigh_v6
   (Jiawei Ye)
 
 - Fix BPF sockmap with kTLS to reject vsock and unix sockets
   upon kTLS context retrieval (Zijian Zhang)
 
 - Fix BPF bits iterator selftest for s390x (Hou Tao)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYIADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZzQV0BUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISIPFywD9Fx9Qc7LdWGmRAmWTqGKSOVPTBC1L
 eC/uXop6sLqapP0A/1KsLQmntvXhp+gmxzPEBdwAwb7/DvyPCQV19FZ/sIkA
 =lDzI
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann:

 - Fix a mismatching RCU unlock flavor in bpf_out_neigh_v6 (Jiawei Ye)

 - Fix BPF sockmap with kTLS to reject vsock and unix sockets upon kTLS
   context retrieval (Zijian Zhang)

 - Fix BPF bits iterator selftest for s390x (Hou Tao)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix mismatched RCU unlock flavour in bpf_out_neigh_v6
  bpf: Add sk_is_inet and IS_ICSK check in tls_sw_has_ctx_tx/rx
  selftests/bpf: Use -4095 as the bad address for bits iterator
2024-11-13 09:14:19 -08:00
Jakub Kicinski
ef04d290c0 net: page_pool: do not count normal frag allocation in stats
Commit 0f6deac3a0 ("net: page_pool: add page allocation stats for
two fast page allocate path") added increments for "fast path"
allocation to page frag alloc. It mentions performance degradation
analysis but the details are unclear. Could be that the author
was simply surprised by the alloc stats not matching packet count.

In my experience the key metric for page pool is the recycling rate.
Page return stats, however, count returned _pages_ not frags.
This makes it impossible to calculate recycling rate for drivers
using the frag API. Here is example output of the page-pool
YNL sample for a driver allocating 1200B frags (4k pages)
with nearly perfect recycling:

  $ ./page-pool
    eth0[2]	page pools: 32 (zombies: 0)
		refs: 291648 bytes: 1194590208 (refs: 0 bytes: 0)
		recycling: 33.3% (alloc: 4557:2256365862 recycle: 200476245:551541893)

The recycling rate is reported as 33.3% because we give out
4096 // 1200 = 3 frags for every recycled page.

Effectively revert the aforementioned commit. This also aligns
with the stats we would see for drivers which do the fragmentation
themselves, although that's not a strong reason in itself.

On the (very unlikely) path where we can reuse the current page
let's bump the "cached" stat. The fact that we don't put the page
in the cache is just an optimization.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://patch.msgid.link/20241109023303.3366500-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12 18:26:58 -08:00
Alexandre Ferrieux
73af53d820 net: sched: cls_u32: Fix u32's systematic failure to free IDR entries for hnodes.
To generate hnode handles (in gen_new_htid()), u32 uses IDR and
encodes the returned small integer into a structured 32-bit
word. Unfortunately, at disposal time, the needed decoding
is not done. As a result, idr_remove() fails, and the IDR
fills up. Since its size is 2048, the following script ends up
with "Filter already exists":

  tc filter add dev myve $FILTER1
  tc filter add dev myve $FILTER2
  for i in {1..2048}
  do
    echo $i
    tc filter del dev myve $FILTER2
    tc filter add dev myve $FILTER2
  done

This patch adds the missing decoding logic for handles that
deserve it.

Fixes: e7614370d6 ("net_sched: use idr to allocate u32 filter handles")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20241110172836.331319-1-alexandre.ferrieux@orange.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12 18:26:03 -08:00
Dmitry Kandybka
b169e76eba mptcp: fix possible integer overflow in mptcp_reset_tout_timer
In 'mptcp_reset_tout_timer', promote 'probe_timestamp' to unsigned long
to avoid possible integer overflow. Compile tested only.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Dmitry Kandybka <d.kandybka@gmail.com>
Link: https://patch.msgid.link/20241107103657.1560536-1-d.kandybka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12 18:11:23 -08:00
Luiz Augusto von Dentz
7967dc8f79 Bluetooth: hci_core: Fix calling mgmt_device_connected
Since 61a939c68e ("Bluetooth: Queue incoming ACL data until
BT_CONNECTED state is reached") there is no long the need to call
mgmt_device_connected as ACL data will be queued until BT_CONNECTED
state.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=219458
Link: https://github.com/bluez/bluez/issues/1014
Fixes: 333b4fd11e ("Bluetooth: L2CAP: Fix uaf in l2cap_connect")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-11-12 11:39:12 -05:00
Johannes Berg
f2aadc7212 wifi: mac80211: pass MBSSID config by reference
It's inefficient and confusing to pass the MBSSID config
by value, requiring the whole struct to be copied. Pass
it by reference instead.

Link: https://patch.msgid.link/20241108092227.48fbd8a00112.I64abc1296a7557aadf798d88db931024486ab3b6@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-12 13:46:55 +01:00
MeiChia Chiu
406c5548c6 wifi: mac80211: Support EHT 1024 aggregation size in TX
Support EHT 1024 aggregation size in TX

The 1024 agg size for RX is supported but not for TX.
This patch adds this support and refactors common parsing logics for
addbaext in both process_addba_resp and process_addba_req into a
function.

Reviewed-by: Shayne Chen <shayne.chen@mediatek.com>
Reviewed-by: Money Wang <money.wang@mediatek.com>
Co-developed-by: Peter Chiu <chui-hao.chiu@mediatek.com>
Signed-off-by: Peter Chiu <chui-hao.chiu@mediatek.com>
Signed-off-by: MeiChia Chiu <MeiChia.Chiu@mediatek.com>
Link: https://patch.msgid.link/20241112083846.32063-1-MeiChia.Chiu@mediatek.com
[pass elems/len instead of mgmt/len/is_req]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-12 13:41:45 +01:00
Mingwei Zheng
8251e7621b net: rfkill: gpio: Add check for clk_enable()
Add check for the return value of clk_enable() to catch the potential
error.

Fixes: 7176ba23f8 ("net: rfkill: add generic gpio rfkill driver")
Signed-off-by: Mingwei Zheng <zmw12306@gmail.com>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Link: https://patch.msgid.link/20241108195341.1853080-1-zmw12306@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-11-12 13:30:31 +01:00
Jakub Kicinski
a58f00ed24 net: sched: cls_api: improve the error message for ID allocation failure
We run into an exhaustion problem with the kernel-allocated filter IDs.
Our allocation problem can be fixed on the user space side,
but the error message in this case was quite misleading:

  "Filter with specified priority/protocol not found" (EINVAL)

Specifically when we can't allocate a _new_ ID because filter with
lowest ID already _exists_, saying "filter not found", is confusing.

Kernel allocates IDs in range of 0xc0000 -> 0x8000, giving out ID one
lower than lowest existing in that range. The error message makes sense
when tcf_chain_tp_find() gets called for GET and DEL but for NEW we
need to provide more specific error messages for all three cases:

 - user wants the ID to be auto-allocated but filter with ID 0x8000
   already exists

 - filter already exists and can be replaced, but user asked
   for a protocol change

 - filter doesn't exist

Caller of tcf_chain_tp_insert_unique() doesn't set extack today,
so don't bother plumbing it in.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20241108010254.2995438-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 12:58:31 +01:00
Michal Luczaj
60cf6206a1 virtio/vsock: Improve MSG_ZEROCOPY error handling
Add a missing kfree_skb() to prevent memory leaks.

Fixes: 581512a6dc ("vsock/virtio: MSG_ZEROCOPY flag support")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 12:16:51 +01:00
Michal Luczaj
fbf7085b3a vsock: Fix sk_error_queue memory leak
Kernel queues MSG_ZEROCOPY completion notifications on the error queue.
Where they remain, until explicitly recv()ed. To prevent memory leaks,
clean up the queue when the socket is destroyed.

unreferenced object 0xffff8881028beb00 (size 224):
  comm "vsock_test", pid 1218, jiffies 4294694897
  hex dump (first 32 bytes):
    90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff  ..!.......!.....
    00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff  ..........!.....
  backtrace (crc 6c7031ca):
    [<ffffffff81418ef7>] kmem_cache_alloc_node_noprof+0x2f7/0x370
    [<ffffffff81d35882>] __alloc_skb+0x132/0x180
    [<ffffffff81d2d32b>] sock_omalloc+0x4b/0x80
    [<ffffffff81d3a8ae>] msg_zerocopy_realloc+0x9e/0x240
    [<ffffffff81fe5cb2>] virtio_transport_send_pkt_info+0x412/0x4c0
    [<ffffffff81fe6183>] virtio_transport_stream_enqueue+0x43/0x50
    [<ffffffff81fe0813>] vsock_connectible_sendmsg+0x373/0x450
    [<ffffffff81d233d5>] ____sys_sendmsg+0x365/0x3a0
    [<ffffffff81d246f4>] ___sys_sendmsg+0x84/0xd0
    [<ffffffff81d26f47>] __sys_sendmsg+0x47/0x80
    [<ffffffff820d3df3>] do_syscall_64+0x93/0x180
    [<ffffffff8220012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 581512a6dc ("vsock/virtio: MSG_ZEROCOPY flag support")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 12:16:51 +01:00
Michal Luczaj
d7b0ff5a86 virtio/vsock: Fix accept_queue memory leak
As the final stages of socket destruction may be delayed, it is possible
that virtio_transport_recv_listen() will be called after the accept_queue
has been flushed, but before the SOCK_DONE flag has been set. As a result,
sockets enqueued after the flush would remain unremoved, leading to a
memory leak.

vsock_release
  __vsock_release
    lock
    virtio_transport_release
      virtio_transport_close
        schedule_delayed_work(close_work)
    sk_shutdown = SHUTDOWN_MASK
(!) flush accept_queue
    release
                                        virtio_transport_recv_pkt
                                          vsock_find_bound_socket
                                          lock
                                          if flag(SOCK_DONE) return
                                          virtio_transport_recv_listen
                                            child = vsock_create_connected
                                      (!)   vsock_enqueue_accept(child)
                                          release
close_work
  lock
  virtio_transport_do_close
    set_flag(SOCK_DONE)
    virtio_transport_remove_sock
      vsock_remove_sock
        vsock_remove_bound
  release

Introduce a sk_shutdown check to disallow vsock_enqueue_accept() during
socket destruction.

unreferenced object 0xffff888109e3f800 (size 2040):
  comm "kworker/5:2", pid 371, jiffies 4294940105
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    28 00 0b 40 00 00 00 00 00 00 00 00 00 00 00 00  (..@............
  backtrace (crc 9e5f4e84):
    [<ffffffff81418ff1>] kmem_cache_alloc_noprof+0x2c1/0x360
    [<ffffffff81d27aa0>] sk_prot_alloc+0x30/0x120
    [<ffffffff81d2b54c>] sk_alloc+0x2c/0x4b0
    [<ffffffff81fe049a>] __vsock_create.constprop.0+0x2a/0x310
    [<ffffffff81fe6d6c>] virtio_transport_recv_pkt+0x4dc/0x9a0
    [<ffffffff81fe745d>] vsock_loopback_work+0xfd/0x140
    [<ffffffff810fc6ac>] process_one_work+0x20c/0x570
    [<ffffffff810fce3f>] worker_thread+0x1bf/0x3a0
    [<ffffffff811070dd>] kthread+0xdd/0x110
    [<ffffffff81044fdd>] ret_from_fork+0x2d/0x50
    [<ffffffff8100785a>] ret_from_fork_asm+0x1a/0x30

Fixes: 3fe356d58e ("vsock/virtio: discard packets only when socket is really closed")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 12:16:51 +01:00
Breno Leitao
12079a59ce net: Implement fault injection forcing skb reallocation
Introduce a fault injection mechanism to force skb reallocation. The
primary goal is to catch bugs related to pointer invalidation after
potential skb reallocation.

The fault injection mechanism aims to identify scenarios where callers
retain pointers to various headers in the skb but fail to reload these
pointers after calling a function that may reallocate the data. This
type of bug can lead to memory corruption or crashes if the old,
now-invalid pointers are used.

By forcing reallocation through fault injection, we can stress-test code
paths and ensure proper pointer management after potential skb
reallocations.

Add a hook for fault injection in the following functions:

 * pskb_trim_rcsum()
 * pskb_may_pull_reason()
 * pskb_trim()

As the other fault injection mechanism, protect it under a debug Kconfig
called CONFIG_FAIL_SKB_REALLOC.

This patch was *heavily* inspired by Jakub's proposal from:
https://lore.kernel.org/all/20240719174140.47a868e6@kernel.org/

CC: Akinobu Mita <akinobu.mita@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20241107-fault_v6-v6-1-1b82cb6ecacd@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 12:05:33 +01:00
Menglong Dong
479aed04e8 net: ip: make ip_route_use_hint() return drop reasons
In this commit, we make ip_route_use_hint() return drop reasons. The
drop reasons that we return are similar to what we do in
ip_route_input_slow(), and no drop reasons are added in this commit.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 11:24:51 +01:00
Menglong Dong
d9340d1e02 net: ip: make ip_mkroute_input/__mkroute_input return drop reasons
In this commit, we make ip_mkroute_input() and __mkroute_input() return
drop reasons.

The drop reason "SKB_DROP_REASON_ARP_PVLAN_DISABLE" is introduced for
the case: the packet which is not IP is forwarded to the in_dev, and
the proxy_arp_pvlan is not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 11:24:51 +01:00
Menglong Dong
50038bf38e net: ip: make ip_route_input() return drop reasons
In this commit, we make ip_route_input() return skb drop reasons that come
from ip_route_input_noref().

Meanwhile, adjust all the call to it.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 11:24:51 +01:00
Menglong Dong
82d9983ebe net: ip: make ip_route_input_noref() return drop reasons
In this commit, we make ip_route_input_noref() return drop reasons, which
come from ip_route_input_rcu().

We need adjust the callers of ip_route_input_noref() to make sure the
return value of ip_route_input_noref() is used properly.

The errno that ip_route_input_noref() returns comes from ip_route_input
and bpf_lwt_input_reroute in the origin logic, and we make them return
-EINVAL on error instead. In the following patch, we will make
ip_route_input() returns drop reasons too.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 11:24:51 +01:00
Menglong Dong
61b95c70f3 net: ip: make ip_route_input_rcu() return drop reasons
In this commit, we make ip_route_input_rcu() return drop reasons, which
come from ip_route_input_mc() and ip_route_input_slow().

The only caller of ip_route_input_rcu() is ip_route_input_noref(). We
adjust it by making it return -EINVAL on error and ignore the reasons that
ip_route_input_rcu() returns. In the following patch, we will make
ip_route_input_noref() returns the drop reasons.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 11:24:51 +01:00
Menglong Dong
5b92112acd net: ip: make ip_route_input_slow() return drop reasons
In this commit, we make ip_route_input_slow() return skb drop reasons,
and following new skb drop reasons are added:

  SKB_DROP_REASON_IP_INVALID_DEST

The only caller of ip_route_input_slow() is ip_route_input_rcu(), and we
adjust it by making it return -EINVAL on error.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 11:24:50 +01:00
Menglong Dong
d46f827016 net: ip: make ip_mc_validate_source() return drop reason
Make ip_mc_validate_source() return drop reason, and adjust the call of
it in ip_route_input_mc().

Another caller of it is ip_rcv_finish_core->udp_v4_early_demux, and the
errno is not checked in detail, so we don't do more adjustment for it.

The drop reason "SKB_DROP_REASON_IP_LOCALNET" is added in this commit.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 11:24:50 +01:00
Menglong Dong
c6c670784b net: ip: make ip_route_input_mc() return drop reason
Make ip_route_input_mc() return drop reason, and adjust the call of it
in ip_route_input_rcu().

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 11:24:50 +01:00
Menglong Dong
37653a0b8a net: ip: make fib_validate_source() support drop reasons
In this commit, we make fib_validate_source() and __fib_validate_source()
return -reason instead of errno on error.

The return value of fib_validate_source can be -errno, 0, and 1. It's hard
to make fib_validate_source() return drop reasons directly.

The fib_validate_source() will return 1 if the scope of the source(revert)
route is HOST. And the __mkroute_input() will mark the skb with
IPSKB_DOREDIRECT in this case (combine with some other conditions). And
then, a REDIRECT ICMP will be sent in ip_forward() if this flag exists. We
can't pass this information to __mkroute_input if we make
fib_validate_source() return drop reasons.

Therefore, we introduce the wrapper fib_validate_source_reason() for
fib_validate_source(), which will return the drop reasons on error.

In the origin logic, LINUX_MIB_IPRPFILTER will be counted if
fib_validate_source() return -EXDEV. And now, we need to adjust it by
checking "reason == SKB_DROP_REASON_IP_RPFILTER". However, this will take
effect only after the patch "net: ip: make ip_route_input_noref() return
drop reasons", as we can't pass the drop reasons from
fib_validate_source() to ip_rcv_finish_core() in this patch.

Following new drop reasons are added in this patch:

  SKB_DROP_REASON_IP_LOCAL_SOURCE
  SKB_DROP_REASON_IP_INVALID_SOURCE

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12 11:24:50 +01:00
Vladimir Vdovin
7d3f3b4367 net: ipv4: Cache pmtu for all packet paths if multipath enabled
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.

An example topology showing the problem:

                    |  host1
                +---------+
                |  dummy0 | 10.179.20.18/32  mtu9000
                +---------+
        +-----------+----------------+
    +---------+                     +---------+
    | ens17f0 |  10.179.2.141/31    | ens17f1 |  10.179.2.13/31
    +---------+                     +---------+
        |    (all here have mtu 9000)    |
    +------+                         +------+
    | ro1  |  10.179.2.140/31        | ro2  |  10.179.2.12/31
    +------+                         +------+
        |                                |
---------+------------+-------------------+------
                        |
                    +-----+
                    | ro3 | 10.10.10.10  mtu1500
                    +-----+
                        |
    ========================================
                some networks
    ========================================
                        |
                    +-----+
                    | eth0| 10.10.30.30  mtu9000
                    +-----+
                        |  host2

host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:

default proto static src 10.179.20.18
        nexthop via 10.179.2.12 dev ens17f1 weight 1
        nexthop via 10.179.2.140 dev ens17f0 weight 1

When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.

Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.

Host1 now have this routes to host2:

ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
    cache expires 521sec mtu 1500

ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
    cache

So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.

Signed-off-by: Vladimir Vdovin <deliran@verdict.gg>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20241108093427.317942-1-deliran@verdict.gg
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 19:07:36 -08:00
Paolo Abeni
ce7356ae35 mptcp: cope racing subflow creation in mptcp_rcv_space_adjust
Additional active subflows - i.e. created by the in kernel path
manager - are included into the subflow list before starting the
3whs.

A racing recvmsg() spooling data received on an already established
subflow would unconditionally call tcp_cleanup_rbuf() on all the
current subflows, potentially hitting a divide by zero error on
the newly created ones.

Explicitly check that the subflow is in a suitable state before
invoking tcp_cleanup_rbuf().

Fixes: c76c695656 ("mptcp: call tcp_cleanup_rbuf on subflows")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/02374660836e1b52afc91966b7535c8c5f7bafb0.1731060874.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 19:06:34 -08:00
Paolo Abeni
5813022985 mptcp: error out earlier on disconnect
Eric reported a division by zero splat in the MPTCP protocol:

Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 UID: 0 PID: 6094 Comm: syz-executor317 Not tainted
6.12.0-rc5-syzkaller-00291-g05b92660cdfe #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
RIP: 0010:__tcp_select_window+0x5b4/0x1310 net/ipv4/tcp_output.c:3163
Code: f6 44 01 e3 89 df e8 9b 75 09 f8 44 39 f3 0f 8d 11 ff ff ff e8
0d 74 09 f8 45 89 f4 e9 04 ff ff ff e8 00 74 09 f8 44 89 f0 99 <f7> 7c
24 14 41 29 d6 45 89 f4 e9 ec fe ff ff e8 e8 73 09 f8 48 89
RSP: 0018:ffffc900041f7930 EFLAGS: 00010293
RAX: 0000000000017e67 RBX: 0000000000017e67 RCX: ffffffff8983314b
RDX: 0000000000000000 RSI: ffffffff898331b0 RDI: 0000000000000004
RBP: 00000000005d6000 R08: 0000000000000004 R09: 0000000000017e67
R10: 0000000000003e80 R11: 0000000000000000 R12: 0000000000003e80
R13: ffff888031d9b440 R14: 0000000000017e67 R15: 00000000002eb000
FS: 00007feb5d7f16c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007feb5d8adbb8 CR3: 0000000074e4c000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__tcp_cleanup_rbuf+0x3e7/0x4b0 net/ipv4/tcp.c:1493
mptcp_rcv_space_adjust net/mptcp/protocol.c:2085 [inline]
mptcp_recvmsg+0x2156/0x2600 net/mptcp/protocol.c:2289
inet_recvmsg+0x469/0x6a0 net/ipv4/af_inet.c:885
sock_recvmsg_nosec net/socket.c:1051 [inline]
sock_recvmsg+0x1b2/0x250 net/socket.c:1073
__sys_recvfrom+0x1a5/0x2e0 net/socket.c:2265
__do_sys_recvfrom net/socket.c:2283 [inline]
__se_sys_recvfrom net/socket.c:2279 [inline]
__x64_sys_recvfrom+0xe0/0x1c0 net/socket.c:2279
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7feb5d857559
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007feb5d7f1208 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
RAX: ffffffffffffffda RBX: 00007feb5d8e1318 RCX: 00007feb5d857559
RDX: 000000800000000e RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007feb5d8e1310 R08: 0000000000000000 R09: ffffffff81000000
R10: 0000000000000100 R11: 0000000000000246 R12: 00007feb5d8e131c
R13: 00007feb5d8ae074 R14: 000000800000000e R15: 00000000fffffdef

and provided a nice reproducer.

The root cause is the current bad handling of racing disconnect.
After the blamed commit below, sk_wait_data() can return (with
error) with the underlying socket disconnected and a zero rcv_mss.

Catch the error and return without performing any additional
operations on the current socket.

Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 419ce133ab ("tcp: allow again tcp_disconnect() when threads are waiting")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/8c82ecf71662ecbc47bf390f9905de70884c9f2d.1731060874.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 19:06:34 -08:00
Martin Karsten
3fcbecbdeb net: Add control functions for irq suspension
The napi_suspend_irqs routine bootstraps irq suspension by elongating
the defer timeout to irq_suspend_timeout.

The napi_resume_irqs routine effectively cancels irq suspension by
forcing the napi to be scheduled immediately.

Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-3-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 18:45:06 -08:00
Martin Karsten
5dc51ec86d net: Add napi_struct parameter irq_suspend_timeout
Add a per-NAPI IRQ suspension parameter, which can be get/set with
netdev-genl.

This patch doesn't change any behavior but prepares the code for other
changes in the following commits which use irq_suspend_timeout as a
timeout for IRQ suspension.

Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 18:45:05 -08:00
Mina Almasry
f2685c00c3 net: fix SO_DEVMEM_DONTNEED looping too long
Exit early if we're freeing more than 1024 frags, to prevent
looping too long.

Also minor code cleanups:
- Flip checks to reduce indentation.
- Use sizeof(*tokens) everywhere for consistentcy.

Cc: Yi Lai <yi1.lai@linux.intel.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107210331.3044434-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 18:11:46 -08:00
Kuniyuki Iwashima
636af13f21 rtnetlink: Register rtnl_dellink() and rtnl_setlink() with RTNL_FLAG_DOIT_PERNET_WIP.
Currently, rtnl_setlink() and rtnl_dellink() cannot be fully converted
to per-netns RTNL due to a lack of handling peer/lower/upper devices in
different netns.

For example, when we change a device in rtnl_setlink() and need to
propagate that to its upper devices, we want to avoid acquiring all netns
locks, for which we do not know the upper limit.

The same situation happens when we remove a device.

rtnl_dellink() could be transformed to remove a single device in the
requested netns and delegate other devices to per-netns work, and
rtnl_setlink() might be ?

Until we come up with a better idea, let's use a new flag
RTNL_FLAG_DOIT_PERNET_WIP for rtnl_dellink() and rtnl_setlink().

This will unblock converting RTNL users where such devices are not related.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241108004823.29419-11-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 17:26:52 -08:00
Kuniyuki Iwashima
d91191ffe2 rtnetlink: Convert RTM_NEWLINK to per-netns RTNL.
Now, we are ready to convert rtnl_newlink() to per-netns RTNL;
rtnl_link_ops is protected by SRCU and netns is prefetched in
rtnl_newlink().

Let's register rtnl_newlink() with RTNL_FLAG_DOIT_PERNET and
push RTNL down as rtnl_nets_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 17:26:52 -08:00
Kuniyuki Iwashima
28690e5361 rtnetlink: Add peer_type in struct rtnl_link_ops.
In ops->newlink(), veth, vxcan, and netkit call rtnl_link_get_net() with
a net pointer, which is the first argument of ->newlink().

rtnl_link_get_net() could return another netns based on IFLA_NET_NS_PID
and IFLA_NET_NS_FD in the peer device's attributes.

We want to get it and fill rtnl_nets->nets[] in advance in rtnl_newlink()
for per-netns RTNL.

All of the three get the peer netns in the same way:

  1. Call rtnl_nla_parse_ifinfomsg()
  2. Call ops->validate() (vxcan doesn't have)
  3. Call rtnl_link_get_net_tb()

Let's add a new field peer_type to struct rtnl_link_ops and prefetch
netns in the peer ifla to add it to rtnl_nets in rtnl_newlink().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 17:26:52 -08:00
Kuniyuki Iwashima
cbaaa6326b rtnetlink: Introduce struct rtnl_nets and helpers.
rtnl_newlink() needs to hold 3 per-netns RTNL: 2 for a new device
and 1 for its peer.

We will add rtnl_nets_lock() later, which performs the nested locking
based on struct rtnl_nets, which has an array of struct net pointers.

rtnl_nets_add() adds a net pointer to the array and sorts it so that
rtnl_nets_lock() can simply acquire per-netns RTNL from array[0] to [2].

Before calling rtnl_nets_add(), get_net() must be called for the net,
and rtnl_nets_destroy() will call put_net() for each.

Let's apply the helpers to rtnl_newlink().

When CONFIG_DEBUG_NET_SMALL_RTNL is disabled, we do not call
rtnl_net_lock() thus do not care about the array order, so
rtnl_net_cmp_locks() returns -1 so that the loop in rtnl_nets_add()
can be optimised to NOP.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 17:26:51 -08:00
Kuniyuki Iwashima
68297dbb96 rtnetlink: Remove __rtnl_link_register()
link_ops is protected by link_ops_mutex and no longer needs RTNL,
so we have no reason to have __rtnl_link_register() separately.

Let's remove it and call rtnl_link_register() from ifb.ko and
dummy.ko.

Note that both modules' init() work on init_net only, so we need
not export pernet_ops_rwsem and can use rtnl_net_lock() there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 17:26:51 -08:00
Kuniyuki Iwashima
6b57ff21a3 rtnetlink: Protect link_ops by mutex.
rtnl_link_unregister() holds RTNL and calls synchronize_srcu(),
but rtnl_newlink() will acquire SRCU frist and then RTNL.

Then, we need to unlink ops and call synchronize_srcu() outside
of RTNL to avoid the deadlock.

   rtnl_link_unregister()       rtnl_newlink()
   ----                         ----
   lock(rtnl_mutex);
                                lock(&ops->srcu);
                                lock(rtnl_mutex);
   sync(&ops->srcu);

Let's move as such and add a mutex to protect link_ops.

Now, link_ops is protected by its dedicated mutex and
rtnl_link_register() no longer needs to hold RTNL.

While at it, we move the initialisation of ops->dellink and
ops->srcu out of the mutex scope.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 17:26:51 -08:00
Kuniyuki Iwashima
d5ec8d91f8 rtnetlink: Remove __rtnl_link_unregister().
rtnl_link_unregister() holds RTNL and calls __rtnl_link_unregister(),
where we call synchronize_srcu() to wait inflight RTM_NEWLINK requests
for per-netns RTNL.

We put synchronize_srcu() in __rtnl_link_unregister() due to ifb.ko
and dummy.ko.

However, rtnl_newlink() will acquire SRCU before RTNL later in this
series.  Then, lockdep will detect the deadlock:

   rtnl_link_unregister()       rtnl_newlink()
   ----                         ----
   lock(rtnl_mutex);
                                lock(&ops->srcu);
                                lock(rtnl_mutex);
   sync(&ops->srcu);

To avoid the problem, we must call synchronize_srcu() before RTNL in
rtnl_link_unregister().

As a preparation, let's remove __rtnl_link_unregister().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11 17:26:51 -08:00