The error path is complicated in unix_dgram_sendmsg() because there
are two timings when other could be non-NULL: when it's fetched from
unix_peer_get() and when it's looked up by unix_find_other().
Let's move unix_peer_get() to the else branch for unix_find_other()
and clean up the error paths.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When other has SOCK_DEAD in unix_dgram_sendmsg(), we hold
unix_state_lock() for the sender socket first.
However, we do not need it for sk->sk_type.
Let's move the lock down a bit.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When other has SOCK_DEAD in unix_dgram_sendmsg(), we call sock_put() for
it first and then set NULL to other before jumping to the error path.
This is to skip sock_put() in the error path.
Let's not set NULL to other and defer the sock_put() to the error path
to clean up the labels later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There are two paths jumping to the restart label in unix_dgram_sendmsg().
One requires another lookup and sk_filter(), but the other doesn't.
Let's split the label to make each flow more straightforward.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In unix_dgram_sendmsg(), we use a local variable sunaddr pointing
NULL or msg->msg_name based on msg->msg_namelen.
Let's remove sunaddr and simplify the usage.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When other is NULL in unix_dgram_sendmsg(), we check if sunaddr
is NULL before looking up a receiver socket.
There are three paths going through the check, but it's always
false for 2 out of the 3 paths: the first socket lookup and the
second 'goto restart'.
The condition can be true for the first 'goto restart' only when
SOCK_DEAD is flagged for the socket found with msg->msg_name.
Let's move the check to the single appropriate path.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We will introduce skb drop reason for AF_UNIX, then we need to
set an errno and a drop reason for each path.
Let's set an error only when it's needed in unix_dgram_sendmsg().
Then, we need not (re)set 0 to err.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If we move send_sig() to the SEND_SHUTDOWN check before
the while loop, then we can reuse the same kfree_skb()
after the pipe_err_free label.
Let's gather the scattered kfree_skb()s in error paths.
While at it, some style issues are fixed, and the pipe_err_free
label is renamed to out_pipe to match other label names.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We will introduce skb drop reason for AF_UNIX, then we need to
set an errno and a drop reason for each path.
Let's set an error only when it's needed in unix_stream_sendmsg().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The label order is weird in unix_stream_connect(), and all NULL checks
are unnecessary if reordered.
Let's clean up the error paths to make it easy to set a drop reason
for each path.
While at it, a comment with the old style is updated.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We will introduce skb drop reason for AF_UNIX, then we need to
set an errno and a drop reason for each path.
Let's set an error only when it's needed in unix_stream_connect().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the
ancillary data returned by recvmsg().
This is analogous to the existing support for SO_RCVMARK,
as implemented in commit 6fd1d51cfa ("net: SO_RCVMARK socket option
for SO_MARK with recvmsg()").
Reviewed-by: Willem de Bruijn <willemb@google.com>
Suggested-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com>
Link: https://patch.msgid.link/20241213084457.45120-5-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Linux socket API currently allows setting SO_PRIORITY at the
socket level, applying a uniform priority to all packets sent through
that socket. The exception to this is IP_TOS, when the priority value
is calculated during the handling of
ancillary data, as implemented in commit f02db315b8 ("ipv4: IP_TOS
and IP_TTL can be specified as ancillary data").
However, this is a computed
value, and there is currently no mechanism to set a custom priority
via control messages prior to this patch.
According to this patch, if SO_PRIORITY is specified as ancillary data,
the packet is sent with the priority value set through
sockc->priority, overriding the socket-level values
set via the traditional setsockopt() method. This is analogous to
the existing support for SO_MARK, as implemented in
commit c6af0c227a ("ip: support SO_MARK cmsg").
If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that
takes precedence is the last one in the cmsg list.
This patch has the side effect that raw_send_hdrinc now interprets cmsg
IP_TOS.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Suggested-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com>
Link: https://patch.msgid.link/20241213084457.45120-3-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simplify priority setting permissions with the 'sk_set_prio_allowed'
function, centralizing the validation logic. This change is made in
anticipation of a second caller in a following patch.
No functional changes.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com>
Link: https://patch.msgid.link/20241213084457.45120-2-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When userspace is adding data to an RPC call for transmission, it must pass
MSG_MORE to sendmsg() if it intends to add more data in future calls to
sendmsg(). Calling sendmsg() without MSG_MORE being asserted closes the
transmission phase of the call (assuming sendmsg() adds all the data
presented) and further attempts to add more data should be rejected.
However, this is no longer the case. The change of call state that was
previously the guard got bumped over to the I/O thread, which leaves a
window for a repeat sendmsg() to insert more data. This previously went
unnoticed, but the more recent patch that changed the structures behind the
Tx queue added a warning:
WARNING: CPU: 3 PID: 6639 at net/rxrpc/sendmsg.c:296 rxrpc_send_data+0x3f2/0x860
and rejected the additional data, returning error EPROTO.
Fix this by adding a guard flag to the call, setting the flag when we queue
the final packet and then rejecting further attempts to add data with
EPROTO.
Fixes: 2d689424b6 ("rxrpc: Move call state changes from sendmsg to I/O thread")
Reported-by: syzbot+ff11be94dfcd7a5af8da@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/6757fb68.050a0220.2477f.005f.GAE@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: syzbot+ff11be94dfcd7a5af8da@syzkaller.appspotmail.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/2870480.1734037462@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use spin_lock_irq(), not spin_lock_bh() to take the lock when accessing the
->attend_link() to stop a delay in the I/O thread due to an interrupt being
taken in the app thread whilst that holds the lock and vice versa.
Fixes: a2ea9a9072 ("rxrpc: Use irq-disabling spinlocks between app and I/O thread")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/2870146.1734037095@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce support for ETHTOOL_MSG_TSCONFIG_GET/SET ethtool netlink socket
to read and configure hwtstamp configuration of a PHC provider. Note that
simultaneous hwtstamp isn't supported; configuring a new one disables the
previous setting.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Either the MAC or the PHY can provide hwtstamp, so we should be able to
read the tsinfo for any hwtstamp provider.
Enhance 'get' command to retrieve tsinfo of hwtstamp providers within a
network topology.
Add support for a specific dump command to retrieve all hwtstamp
providers within the network topology, with added functionality for
filtered dump to target a single interface.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce the description of a hwtstamp provider, mainly defined with a
the hwtstamp source and the phydev pointer.
Add a hwtstamp provider description within the netdev structure to
allow saving the hwtstamp we want to use. This prepares for future
support of an ethtool netlink command to select the desired hwtstamp
provider. By default, the old API that does not support hwtstamp
selectability is used, meaning the hwtstamp provider pointer is unset.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the net_hwtstamp_validate function accessible in prevision to use
it from ethtool to validate the hwtstamp configuration before setting it.
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the dev_get_hwtstamp_phylib function accessible in prevision to use
it from ethtool to read the hwtstamp current configuration.
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This introduces 5 counters to keep track of key updates:
Tls{Rx,Tx}Rekey{Ok,Error} and TlsRxRekeyReceived.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the possibility to change the key and IV when using
TLS1.3. Changing the cipher or TLS version is not supported.
Once we have updated the RX key, we can unblock the receive side. If
the rekey fails, the context is unmodified and userspace is free to
retry the update or close the socket.
This change only affects tls_sw, since 1.3 offload isn't supported.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a TLS handshake record carrying a KeyUpdate message is received,
all subsequent records will be encrypted with a new key. We need to
stop decrypting incoming records with the old key, and wait until
userspace provides a new key.
Make a note of this in the RX context just after decrypting that
record, and stop recvmsg/splice calls with EKEYEXPIRED until the new
key is available.
key_update_pending can't be combined with the existing bitfield,
because we will read it locklessly in ->poll.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon successful return, mptcp_pm_parse_addr() returns 0. There is no need
to set "err = 0" after this. So after mptcp_nl_find_ssk() returns, just
need to set "err = -ESRCH", then release and free msk socket if it returns
NULL.
Also, no need to define the variable "subflow" in subflow_destroy(), use
mptcp_subflow_ctx(ssk) directly.
This patch doesn't change the behaviour of the code, just refactoring.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-7-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Generally, in the path manager interfaces, the local address is defined as
an mptcp_pm_addr_entry type address, while the remote address is defined as
an mptcp_addr_info type one:
(struct mptcp_pm_addr_entry *local, struct mptcp_addr_info *remote)
But subflow_destroy() interface uses two mptcp_addr_info type parameters.
This patch changes the first one to mptcp_pm_addr_entry type and use helper
mptcp_pm_parse_entry() to parse it instead of using mptcp_pm_parse_addr().
This patch doesn't change the behaviour of the code, just refactoring.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-6-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mptcp_pm_remove_addrs() actually only deletes one address, which does
not match its name. This patch renames it to mptcp_pm_remove_addr_entry()
and changes the parameter "rm_list" to "entry".
With the help of mptcp_pm_remove_addr_entry(), it's no longer necessary to
move the entry to be deleted to free_list and then traverse the list to
delete the entry, which is not allowed in BPF. The entry can be directly
deleted through list_del_rcu() and sock_kfree_s() now.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-5-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since mptcp_pm_remove_addrs() is only called from the userspace PM, this
patch moves it into pm_userspace.c.
For this, lookup_subflow_by_saddr() and remove_anno_list_by_saddr()
helpers need to be exported in protocol.h. Also add "mptcp_" prefix for
these helpers.
Here, mptcp_pm_remove_addrs() is not changed to a static function because
it will be used in BPF Path Manager.
This patch doesn't change the behaviour of the code, just refactoring.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-4-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Each userspace pm netlink function uses nla_get_u32() to get the msk
token value, then pass it to mptcp_token_get_sock() to get the msk.
Finally check whether userspace PM is selected on this msk. It makes
sense to wrap them into a helper, named mptcp_userspace_pm_get_sock(),
to do this.
This patch doesn't change the behaviour of the code, just refactoring.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-3-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to mptcp_for_each_subflow() macro, this patch adds a new macro
mptcp_for_each_userspace_pm_addr() for userspace PM to iterate over the
address entries on the local address list userspace_pm_local_addr_list
of the mptcp socket.
This patch doesn't change the behaviour of the code, just refactoring.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-2-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Like __lookup_addr() helper in pm_netlink.c, a new helper
mptcp_userspace_pm_lookup_addr() is also defined in pm_userspace.c.
It looks up the corresponding mptcp_pm_addr_entry address in
userspace_pm_local_addr_list through the passed "addr" parameter
and returns the found address entry.
This helper can be used in mptcp_userspace_pm_delete_local_addr(),
mptcp_userspace_pm_set_flags(), mptcp_userspace_pm_get_local_id()
and mptcp_userspace_pm_is_backup() to simplify the code.
Please note that with this change now list_for_each_entry() is used in
mptcp_userspace_pm_append_new_local_addr(), not list_for_each_entry_safe(),
but that's OK to do so because mptcp_userspace_pm_lookup_addr() only
returns an entry from the list, the list hasn't been modified here.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241213-net-next-mptcp-pm-misc-cleanup-v1-1-ddb6d00109a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
adding a route metric greater than 0x7fff_ffff leads to an
unintended wrap when printing the underlying u32 as an
unsigned int (`%d`) thus incorrectly rendering the metric
as negative. Formatting using `%u` corrects the issue.
Signed-off-by: Maximilian Güntner <code@mguentner.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241212161911.51598-1-code@mguentner.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This change introduces netlink notifications for multicast address
changes. The following features are included:
* Addition and deletion of multicast addresses are reported using
RTM_NEWMULTICAST and RTM_DELMULTICAST messages with AF_INET and
AF_INET6.
* Two new notification groups: RTNLGRP_IPV4_MCADDR and
RTNLGRP_IPV6_MCADDR are introduced for receiving these events.
This change allows user space applications (e.g., ip monitor) to
efficiently track multicast group memberships by listening for netlink
events. Previously, applications relied on inefficient polling of
procfs, introducing delays. With netlink notifications, applications
receive realtime updates on multicast group membership changes,
enabling more precise metrics collection and system monitoring.
This change also unlocks the potential for implementing a wide range
of sophisticated multicast related features in user space by allowing
applications to combine kernel provided multicast address information
with user space data and communicate decisions back to the kernel for
more fine grained control. This mechanism can be used for various
purposes, including multicast filtering, IGMP/MLD offload, and
IGMP/MLD snooping.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Co-developed-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
Link: https://lore.kernel.org/r/20180906091056.21109-1-pruddy@vyatta.att-mail.com
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dmabuf dma-addresses should not be dma_sync'd for CPU/device. Typically
its the driver responsibility to dma_sync for CPU, but the driver should
not dma_sync for CPU if the netmem is actually coming from a dmabuf
memory provider.
The page_pool already exposes a helper for dma_sync_for_cpu:
page_pool_dma_sync_for_cpu. Upgrade this existing helper to handle
netmem, and have it skip dma_sync if the memory is from a dmabuf memory
provider. Drivers should migrate to using this helper when adding
support for netmem.
Also minimize the impact on the dma syncing performance for pages. Special
case the dma-sync path for pages to not go through the overhead checks
for dma-syncing and conversion to netmem.
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20241211212033.1684197-5-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the `dma_map` and `dma_sync` checks to `page_pool_init` to make
them generic. Set dma_sync to false for devmem memory provider because
the dma_sync APIs should not be used for dma_buf backed devmem memory
provider.
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20241211212033.1684197-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
page_pool_alloc_netmem (without an s) was the mirror of
page_pool_alloc_pages (with an s), which was confusing.
Rename to page_pool_alloc_netmems so it's the mirror of
page_pool_alloc_pages.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241211212033.1684197-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, __xdp_return() takes pointer to the virtual memory to free
a buffer. Apart from that this sometimes provokes redundant
data <--> page conversions, taking data pointer effectively prevents
lots of XDP code to support non-page-backed buffers, as there's no
mapping for the non-host memory (data is always NULL).
Just convert it to always take netmem reference. For
xdp_return_{buff,frame*}(), this chops off one page_address() per each
frag and adds one virt_to_netmem() (same as virt_to_page()) per header
buffer. For __xdp_return() itself, it removes one virt_to_page() for
MEM_TYPE_PAGE_POOL and another one for MEM_TYPE_PAGE_ORDER0, adding
one page_address() for [not really common nowadays]
MEM_TYPE_PAGE_SHARED, but the main effect is that the abovementioned
functions won't die or memleak anymore if the frame has non-host memory
attached and will correctly free those.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Initially, xdp_frame::mem.id was used to search for the corresponding
&page_pool to return the page correctly.
However, after that struct page was extended to have a direct pointer
to its PP (netmem has it as well), further keeping of this field makes
no sense. xdp_return_frame_bulk() still used it to do a lookup, and
this leftover is now removed.
Remove xdp_frame::mem and replace it with ::mem_type, as only memory
type still matters and we need to know it to be able to free the frame
correctly.
As a cute side effect, we can now make every scalar field in &xdp_frame
of 4 byte width, speeding up accesses to them.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The main reason for this change was to allow mixing pages from different
&page_pools within one &xdp_buff/&xdp_frame. Why not? With stuff like
devmem and io_uring zerocopy Rx, it's required to have separate PPs for
header buffers and payload buffers.
Adjust xdp_return_frame_bulk() and page_pool_put_netmem_bulk(), so that
they won't be tied to a particular pool. Let the latter create a
separate bulk of pages which's PP is different from the first netmem of
the bulk and process it after the main loop.
This greatly optimizes xdp_return_frame_bulk(): no more hashtable
lookups and forced flushes on PP mismatch. Also make
xdp_flush_frame_bulk() inline, as it's just one if + function call + one
u32 read, not worth extending the call ladder.
Co-developed-by: Toke Høiland-Jørgensen <toke@redhat.com> # iterative
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org> # while (count)
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add three qdisc-specific drop reasons and use them in sch_cake:
1) SKB_DROP_REASON_QDISC_OVERLIMIT
Whenever the total queue limit for a qdisc instance is exceeded
and a packet is dropped to make room.
2) SKB_DROP_REASON_QDISC_CONGESTED
Whenever a packet is dropped by the qdisc AQM algorithm because
congestion is detected.
3) SKB_DROP_REASON_CAKE_FLOOD
Whenever a packet is dropped by the flood protection part of the
CAKE AQM algorithm (BLUE).
Also use the existing SKB_DROP_REASON_QUEUE_PURGE in cake_clear_tin().
Reasons show up as:
perf record -a -e skb:kfree_skb sleep 1; perf script
iperf3 665 [005] 848.656964: skb:kfree_skb: skbaddr=0xffff98168a333500 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x10f0 reason: QDISC_OVERLIMIT
swapper 0 [001] 909.166055: skb:kfree_skb: skbaddr=0xffff98168280cee0 rx_sk=(nil) protocol=34525 location=cake_dequeue+0x5ef reason: QDISC_CONGESTED
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241211-cake-drop-reason-v2-1-920afadf4d1b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - fix to a fix:
- rtnetlink: fix error code in rtnl_newlink()
- tipc: fix NULL deref in cleanup_bearer()
Current release - regressions:
- ip: fix warning about invalid return from in ip_route_input_rcu()
Current release - new code bugs:
- udp: fix L4 hash after reconnect
- eth: lan969x: fix cyclic dependency between modules
- eth: bnxt_en: fix potential crash when dumping FW log coredump
Previous releases - regressions:
- wifi: mac80211:
- fix a queue stall in certain cases of channel switch
- wake the queues in case of failure in resume
- splice: do not checksum AF_UNIX sockets
- virtio_net: fix BUG()s in BQL support due to incorrect accounting
of purged packets during interface stop
- eth: stmmac: fix TSO DMA API mis-usage causing oops
- eth: bnxt_en: fixes for HW GRO: GSO type on 5750X chips and
oops due to incorrect aggregation ID mask on 5760X chips
Previous releases - always broken:
- Bluetooth: improve setsockopt() handling of malformed user input
- eth: ocelot: fix PTP timestamping in presence of packet loss
- ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply
not supported
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmdbCXgACgkQMUZtbf5S
Irtizg/9GVtTtu0OeQpHlxOpXOdqRciHDHBcjc0+rYihHazA47wtOszPg2BDiebV
1D+uTaPoxuJUZo9jDAGMerUpy6gmC8+4h9gp72oSU9uGNHTrDWsylsn16foFkmpg
hMsq+bzYr9ayekIXoI4T//PQ8MO8fqLFPdJmFPIKjkTtsrCzzARck9R4uDlWzrJj
v5cQY+q/6qnwZTvvto67ahjdKUw8k3XIRZxLDqrDiW+zUzdk9XRwK46AdP3eybcx
OCMHvXmWx6DTbjeEbzhq5YwDGAnBOE9rP4vJmpV9y+PcPDCmPzt7IDNWACcEPHY4
3vuZv3JJP/5MIqGHidDn1JYgWl/Y3iv5ZfKInG585XH+5VWemq3WL1JOS2ua6Xmu
hoGhwNTGea4KtCeutE8xSwMSBTxswkdPb93ZFPt28zKAN118chBvGLRv2jepSvQR
3AQhJ9bgGuErHMYh5vdiluRVj/4bwSIFqEH6vr6w9+DUDFiTSKERLXSJ8dc8S+9K
ghd/I8POb4VTfjZIyHzo1DJOulPXe84KGMcOuAfh0AV7o5HcuP+oNdR3+qS2Lf+G
EByIX8osZsHjqaVr5ba+KnZz2XrdO7mbE54fCKa9ZUwkNIbcCEqOJBqcMlPWxvtK
whrGDOS8ifYYK6fL6IFO5CtxBvWmQgMOYV6Sjp9J27PD4jiMrms=
=TDKt
-----END PGP SIGNATURE-----
Merge tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, netfilter and wireless.
Current release - fix to a fix:
- rtnetlink: fix error code in rtnl_newlink()
- tipc: fix NULL deref in cleanup_bearer()
Current release - regressions:
- ip: fix warning about invalid return from in ip_route_input_rcu()
Current release - new code bugs:
- udp: fix L4 hash after reconnect
- eth: lan969x: fix cyclic dependency between modules
- eth: bnxt_en: fix potential crash when dumping FW log coredump
Previous releases - regressions:
- wifi: mac80211:
- fix a queue stall in certain cases of channel switch
- wake the queues in case of failure in resume
- splice: do not checksum AF_UNIX sockets
- virtio_net: fix BUG()s in BQL support due to incorrect accounting
of purged packets during interface stop
- eth:
- stmmac: fix TSO DMA API mis-usage causing oops
- bnxt_en: fixes for HW GRO: GSO type on 5750X chips and oops
due to incorrect aggregation ID mask on 5760X chips
Previous releases - always broken:
- Bluetooth: improve setsockopt() handling of malformed user input
- eth: ocelot: fix PTP timestamping in presence of packet loss
- ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply not
supported"
* tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
net: dsa: tag_ocelot_8021q: fix broken reception
net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries
net: renesas: rswitch: fix initial MPIC register setting
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
Bluetooth: iso: Fix circular lock in iso_listen_bis
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Bluetooth: iso: Fix recursive locking warning
Bluetooth: iso: Always release hdev at the end of iso_listen_bis
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Bluetooth: hci_core: Fix sleeping function called from invalid context
team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
team: Fix initial vlan_feature set in __team_compute_features
bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features
net, team, bonding: Add netdev_base_features helper
net/sched: netem: account for backlog updates from child qdisc
net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
splice: do not checksum AF_UNIX sockets
net: usb: qmi_wwan: add Telit FE910C04 compositions
...
- SCO: Fix transparent voice setting
- ISO: Locking fixes
- hci_core: Fix sleeping function called from invalid context
- hci_event: Fix using rcu_read_(un)lock while iterating
- btmtk: avoid UAF in btmtk_process_coredump
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmda8pEZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKTf/D/sFRQb6FTdttMV934GDH+1W
DCS2prkqGUi6KGTxFFpT1rKafR0+h1osta1yvtM7h1tbXqpjAsv06ksFEP1vsXgl
Kw1gTn/cEGlK6KQ+oX4ObEBWXtlnYparl86m+OJY8xVc2GEA7GGdbwwsuDLeEv8H
JVoMN2FLb/Io3VHYBO595xh0BK4K61gM0zh/nxwWNxaOH1AzCwoh4oVEyCtlwQpn
5okDjfawHUfU9T/VlzL0TwlXP/Rwi6afvaJ+vt7N6wqgrJ51Q1cXf00kTqKWD2mQ
vsRDjIMn4YYSjH3X27i6xif3jFQ4z1fjto0N4PjE5IgNe+VUtEiQsT2h5NnxA/mt
MWNx9EvUYXLrOkVot91FPJqYTNjGpKr4EBxRdFW1MW3sJX4rCGDVpW7gJF3fG44+
iEFHaZpJ8XlyyT7gD6BffkePv5iicbJtmgk++Dx1Z0ekkvvjA4RkQHFMTwWh2a+Y
s+1qS8rfhmyWf8IdUVCAxbrOW9nXRNFEaRh2ooqzEI/ycXtogzmoAI8g/xnZ7VHg
H2sSOyO4HfsH/nHHkPvaIQL+8pt8EWYIuGUgMhNFuRIYpYvcoDNNpVuhKrvk9Qp+
SsmmIw/ov5M9ucEE24fTf3jaSwin+fORUOMye31yF6tmxJOuTFnMyil8kOPovlu3
IdxLOmCDxYXZCYNZslc65w==
=1Po/
-----END PGP SIGNATURE-----
Merge tag 'for-net-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- SCO: Fix transparent voice setting
- ISO: Locking fixes
- hci_core: Fix sleeping function called from invalid context
- hci_event: Fix using rcu_read_(un)lock while iterating
- btmtk: avoid UAF in btmtk_process_coredump
* tag 'for-net-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
Bluetooth: iso: Fix circular lock in iso_listen_bis
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Bluetooth: iso: Fix recursive locking warning
Bluetooth: iso: Always release hdev at the end of iso_listen_bis
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Bluetooth: hci_core: Fix sleeping function called from invalid context
Bluetooth: Improve setsockopt() handling of malformed user input
====================
Link: https://patch.msgid.link/20241212142806.2046274-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The blamed commit changed the dsa_8021q_rcv() calling convention to
accept pre-populated source_port and switch_id arguments. If those are
not available, as in the case of tag_ocelot_8021q, the arguments must be
pre-initialized with -1.
Due to the bug of passing uninitialized arguments in tag_ocelot_8021q,
dsa_8021q_rcv() does not detect that it needs to populate the
source_port and switch_id, and this makes dsa_conduit_find_user() fail,
which leads to packet loss on reception.
Fixes: dcfe767378 ("net: dsa: tag_sja1105: absorb logic for not overwriting precise info into dsa_8021q_rcv()")
Signed-off-by: Robert Hodaszi <robert.hodaszi@digi.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241211144741.1415758-1-robert.hodaszi@digi.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The voice setting is used by sco_connect() or sco_conn_defer_accept()
after being set by sco_sock_setsockopt().
The PCM part of the voice setting is used for offload mode through PCM
chipset port.
This commits add support for mSBC 16 bits offloading, i.e. audio data
not transported over HCI.
The BCM4349B1 supports 16 bits transparent data on its I2S port.
If BT_VOICE_TRANSPARENT is used when accepting a SCO connection, this
gives only garbage audio while using BT_VOICE_TRANSPARENT_16BIT gives
correct audio.
This has been tested with connection to iPhone 14 and Samsung S24.
Fixes: ad10b1a487 ("Bluetooth: Add Bluetooth socket voice option")
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This updates iso_sock_accept to use nested locking for the parent
socket, to avoid lockdep warnings caused because the parent and
child sockets are locked by the same thread:
[ 41.585683] ============================================
[ 41.585688] WARNING: possible recursive locking detected
[ 41.585694] 6.12.0-rc6+ #22 Not tainted
[ 41.585701] --------------------------------------------
[ 41.585705] iso-tester/3139 is trying to acquire lock:
[ 41.585711] ffff988b29530a58 (sk_lock-AF_BLUETOOTH)
at: bt_accept_dequeue+0xe3/0x280 [bluetooth]
[ 41.585905]
but task is already holding lock:
[ 41.585909] ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
at: iso_sock_accept+0x61/0x2d0 [bluetooth]
[ 41.586064]
other info that might help us debug this:
[ 41.586069] Possible unsafe locking scenario:
[ 41.586072] CPU0
[ 41.586076] ----
[ 41.586079] lock(sk_lock-AF_BLUETOOTH);
[ 41.586086] lock(sk_lock-AF_BLUETOOTH);
[ 41.586093]
*** DEADLOCK ***
[ 41.586097] May be due to missing lock nesting notation
[ 41.586101] 1 lock held by iso-tester/3139:
[ 41.586107] #0: ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
at: iso_sock_accept+0x61/0x2d0 [bluetooth]
Fixes: ccf74f2390 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Since hci_get_route holds the device before returning, the hdev
should be released with hci_dev_put at the end of iso_listen_bis
even if the function returns with an error.
Fixes: 02171da6e8 ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The usage of rcu_read_(un)lock while inside list_for_each_entry_rcu is
not safe since for the most part entries fetched this way shall be
treated as rcu_dereference:
Note that the value returned by rcu_dereference() is valid
only within the enclosing RCU read-side critical section [1]_.
For example, the following is **not** legal::
rcu_read_lock();
p = rcu_dereference(head.next);
rcu_read_unlock();
x = p->address; /* BUG!!! */
rcu_read_lock();
y = p->data; /* BUG!!! */
rcu_read_unlock();
Fixes: a0bfde167b ("Bluetooth: ISO: Add support for connecting multiple BISes")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>