The bpf_redirect_map() helper used by XDP programs doesn't return any
indication of whether it can successfully redirect to the map index it was
given. Instead, BPF programs have to track this themselves, leading to
programs using duplicate maps to track which entries are populated in the
devmap.
This patch fixes this by moving the map lookup into the bpf_redirect_map()
helper, which makes it possible to return failure to the eBPF program. The
lower bits of the flags argument is used as the return code, which means
that existing users who pass a '0' flag argument will get XDP_ABORTED.
With this, a BPF program can check the return code from the helper call and
react by, for instance, substituting a different redirect. This works for
any type of map used for redirect.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The bpf_redirect_info struct has an 'ifindex' member which was named back
when the redirects could only target egress interfaces. Now that we can
also redirect to sockets and CPUs, this is a bit misleading, so rename the
member to tgt_index.
Reorder the struct members so we can have 'tgt_index' and 'tgt_value' next
to each other in a subsequent patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The socket map uses a linked list instead of a bitmap to keep track of
which entries to flush. Do the same for devmap and cpumap, as this means we
don't have to care about the map index when enqueueing things into the
map (and so we can cache the map lookup).
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
passing them down to the kernel or bypass kernel completely.
BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
kernel returns.
Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
The buffer memory is pre-allocated (because I don't think there is
a precedent for working with __user memory from bpf). This might be
slow to do for each {s,g}etsockopt call, that's why I've added
__cgroup_bpf_prog_array_is_empty that exits early if there is nothing
attached to a cgroup. Note, however, that there is a race between
__cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
program layout might have changed; this should not be a problem
because in general there is a race between multiple calls to
{s,g}etsocktop and user adding/removing bpf progs from a cgroup.
The return code of the BPF program is handled as follows:
* 0: EPERM
* 1: success, continue with next BPF program in the cgroup chain
v9:
* allow overwriting setsockopt arguments (Alexei Starovoitov):
* use set_fs (same as kernel_setsockopt)
* buffer is always kzalloc'd (no small on-stack buffer)
v8:
* use s32 for optlen (Andrii Nakryiko)
v7:
* return only 0 or 1 (Alexei Starovoitov)
* always run all progs (Alexei Starovoitov)
* use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
(decided to use optval=-1 instead, optval=0 might be a valid input)
* call getsockopt hook after kernel handlers (Alexei Starovoitov)
v6:
* rework cgroup chaining; stop as soon as bpf program returns
0 or 2; see patch with the documentation for the details
* drop Andrii's and Martin's Acked-by (not sure they are comfortable
with the new state of things)
v5:
* skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
v4:
* don't export bpf_sk_fullsock helper (Martin Lau)
* size != sizeof(__u64) for uapi pointers (Martin Lau)
* offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
v3:
* typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
* reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
Nakryiko)
* use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
* use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
* new CG_SOCKOPT_ACCESS macro to wrap repeated parts
v2:
* moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
* aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
* bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
* added [0,2] return code check to verifier (Martin Lau)
* dropped unused buf[64] from the stack (Martin Lau)
* use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
* dropped bpf_target_off from ctx rewrites (Martin Lau)
* use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some drivers want to access the data transmitted in order to implement
acceleration features of the NICs. It is also useful in AF_XDP TX flow.
Change the xsk_umem_consume_tx API to return the whole xdp_desc, that
contains the data pointer, length and DMA address, instead of only the
latter two. Adapt the implementation of i40e and ixgbe to this change.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Make it possible for the application to determine whether the AF_XDP
socket is running in zero-copy mode. To achieve this, add a new
getsockopt option XDP_OPTIONS that returns flags. The only flag
supported for now is the zero-copy mode indicator.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add a function that checks whether the Fill Ring has the specified
amount of descriptors available. It will be useful for mlx5e that wants
to check in advance, whether it can allocate a bulk of RX descriptors,
to get the best performance.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fix sparse warning:
net/core/xdp.c:88:6: warning:
symbol '__mem_id_disconnect' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For tx path, in most cases, we still have to take refcnt on the dst
cause the caller is caching the dst somewhere. But it still is
beneficial to make use of RT6_LOOKUP_F_DST_NOREF flag while doing the
route lookup. It is cause this flag prevents manipulating refcnt on
net->ipv6.ip6_null_entry when doing fib6_rule_lookup() to traverse each
routing table. The null_entry is a shared object and constant updates on
it cause false sharing.
We converted the current major lookup function ip6_route_output_flags()
to make use of RT6_LOOKUP_F_DST_NOREF.
Together with the change in the rx path, we see noticable performance
boost:
I ran synflood tests between 2 hosts under the same switch. Both hosts
have 20G mlx NIC, and 8 tx/rx queues.
Sender sends pure SYN flood with random src IPs and ports using trafgen.
Receiver has a simple TCP listener on the target port.
Both hosts have multiple custom rules:
- For incoming packets, only local table is traversed.
- For outgoing packets, 3 tables are traversed to find the route.
The packet processing rate on the receiver is as follows:
- Before the fix: 3.78Mpps
- After the fix: 5.50Mpps
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_route_input() is the key function to do the route lookup in the
rx data path. All the callers to this function are already holding rcu
lock. So it is fairly easy to convert it to not take refcnt on the dst:
We pass in flag RT6_LOOKUP_F_DST_NOREF and do skb_dst_set_noref().
This saves a few atomic inc or dec operations and should boost
performance overall.
This also makes the logic more aligned with v4.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch specifically converts the rule lookup logic to honor this
flag and not release refcnt when traversing each rule and calling
lookup() on each routing table.
Similar to previous patch, we also need some special handling of dst
entries in uncached list because there is always 1 refcnt taken for them
even if RT6_LOOKUP_F_DST_NOREF flag is set.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize rt6->rt6i_uncached on the following pre-allocated dsts:
net->ipv6.ip6_null_entry
net->ipv6.ip6_prohibit_entry
net->ipv6.ip6_blk_hole_entry
This is a preparation patch for later commits to be able to distinguish
dst entries in uncached list by doing:
!list_empty(rt6->rt6i_uncached)
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new flag is to instruct the route lookup function to not take
refcnt on the dst entry. The user which does route lookup with this flag
must properly use rcu protection.
ip6_pol_route() is the major route lookup function for both tx and rx
path.
In this function:
Do not take refcnt on dst if RT6_LOOKUP_F_DST_NOREF flag is set, and
directly return the route entry. The caller should be holding rcu lock
when using this flag, and decide whether to take refcnt or not.
One note on the dst cache in the uncached_list:
As uncached_list does not consume refcnt, one refcnt is always returned
back to the caller even if RT6_LOOKUP_F_DST_NOREF flag is set.
Uncached dst is only possible in the output path. So in such call path,
caller MUST check if the dst is in the uncached_list before assuming
that there is no refcnt taken on the returned dst.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ops has been iterated to first element when call pre_exit, and
it needs to restore from save_ops, not save ops to save_ops
Fixes: d7d99872c144 ("netns: add pre_exit method to struct pernet_operations")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some changes to the TCP fastopen code to make it more robust
against future changes in the choice of key/cookie size, etc.
- Instead of keeping the SipHash key in an untyped u8[] buffer
and casting it to the right type upon use, use the correct
type directly. This ensures that the key will appear at the
correct alignment if we ever change the way these data
structures are allocated. (Currently, they are only allocated
via kmalloc so they always appear at the correct alignment)
- Use DIV_ROUND_UP when sizing the u64[] array to hold the
cookie, so it is always of sufficient size, even if
TCP_FASTOPEN_COOKIE_MAX is no longer a multiple of 8.
- Drop the 'len' parameter from the tcp_fastopen_reset_cipher()
function, which is no longer used.
- Add endian swabbing when setting the keys and calculating the hash,
to ensure that cookie values are the same for a given key and
source/destination address pair regardless of the endianness of
the server.
Note that none of these are functional changes wrt the current
state of the code, with the exception of the swabbing, which only
affects big endian systems.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix leak of unqueued fragments in ipv6 nf_defrag, from Guillaume
Nault.
2) Don't access the DDM interface unless the transceiver implements it
in bnx2x, from Mauro S. M. Rodrigues.
3) Don't double fetch 'len' from userspace in sock_getsockopt(), from
JingYi Hou.
4) Sign extension overflow in lio_core, from Colin Ian King.
5) Various netem bug fixes wrt. corrupted packets from Jakub Kicinski.
6) Fix epollout hang in hvsock, from Sunil Muthuswamy.
7) Fix regression in default fib6_type, from David Ahern.
8) Handle memory limits in tcp_fragment more appropriately, from Eric
Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits)
tcp: refine memory limit test in tcp_fragment()
inet: clear num_timeout reqsk_alloc()
net: mvpp2: debugfs: Add pmap to fs dump
ipv6: Default fib6_type to RTN_UNICAST when not set
net: hns3: Fix inconsistent indenting
net/af_iucv: always register net_device notifier
net/af_iucv: build proper skbs for HiperTransport
net/af_iucv: remove GFP_DMA restriction for HiperTransport
net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge()
hvsock: fix epollout hang from race condition
net/udp_gso: Allow TX timestamp with UDP GSO
net: netem: fix use after free and double free with packet corruption
net: netem: fix backlog accounting for corrupted GSO frames
net: lio_core: fix potential sign-extension overflow on large shift
tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb
ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL
ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL
tun: wake up waitqueues after IFF_UP is set
net: remove duplicate fetch in sock_getsockopt
tipc: fix issues with early FAILOVER_MSG from peer
...
tcp_fragment() might be called for skbs in the write queue.
Memory limits might have been exceeded because tcp_sendmsg() only
checks limits at full skb (64KB) boundaries.
Therefore, we need to make sure tcp_fragment() wont punish applications
that might have setup very low SO_SNDBUF values.
Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bugfixes:
- SUNRPC: Fix a credential refcount leak
- Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
- SUNRPC: Fix xps refcount imbalance on the error path
- NFS4: Only set creation opendata if O_CREAT
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl0NMHQACgkQ18tUv7Cl
QOujAw//bL4p0ADvCSglqO0ceZcHpHuQVhj10f+tyXfkq3R7WDQmP2qok/+6uZjk
rYxWh6nCqQqTBSmstf4ouLMjdQTk+zDQf38zSSWGPW0U6rCX4kl/yOyskBriYqnf
W4U5+hOCvQ8prKdkcjbhYqdvz6qT9bhc5X7kHgtD66CtvyzUDraYw04Mojzodl91
CPV97rDZbiHBAgZPBFKF+qoTXEL7hQlHUREcR/DZBLV3qrMBsraog1T1OpWRGev0
OAxVAwZyXFcWDFm7mFcItMA4WcUnZbL77gPNtvZSfgYPxHsytHfH8KOmEG7zP5Yy
+blko41nvNR2UVOPQ/zTvWj9pkuHQlacDUrlYdgnOmHuGhufj1/xx4Z5C2dkTTnp
ufjcVXJOqZE7lWyA4IOWapc7gLM3q6I8sUR9w5nLc1meYphNAqHZgK0fTgfbMoo7
JweUBcgGmNvkMpkX561HBe16ENKvcgQQp666VsnTTI/BPZ/BBdlayKGBbxA3j22F
znF4gwznvQ0jVxtlNsibzSwM8GQbc6UM5fGPM+atlPJEzban0waQaIbCW347DViZ
fTXP2NQmvH1x+YNgx6RTRwVBWWT02u/ijQHf7+NvwWLWTKb9OXwQVjlnWHu/E/Qi
MpNXxtBTT6y+DG09VNV9CYwmAsULnlRQWe9RNBfMz+4sjLCyIjg=
=+z30
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.2-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull more NFS client fixes from Anna Schumaker:
"These are mostly refcounting issues that people have found recently.
The revert fixes a suspend recovery performance issue.
- SUNRPC: Fix a credential refcount leak
- Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
- SUNRPC: Fix xps refcount imbalance on the error path
- NFS4: Only set creation opendata if O_CREAT"
* tag 'nfs-for-5.2-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: Fix a credential refcount leak
Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
net :sunrpc :clnt :Fix xps refcount imbalance on the error path
NFS4: Only set creation opendata if O_CREAT
All callers of __rpc_clone_client() pass in a value for args->cred,
meaning that the credential gets assigned and referenced in
the call to rpc_new_client().
Reported-by: Ido Schimmel <idosch@idosch.org>
Fixes: 79caa5fad47c ("SUNRPC: Cache cred of process creating the rpc_client")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Jon Hunter reports:
"I have been noticing intermittent failures with a system suspend test on
some of our machines that have a NFS mounted root file-system. Bisecting
this issue points to your commit 431235818bc3 ("SUNRPC: Declare RPC
timers as TIMER_DEFERRABLE") and reverting this on top of v5.2-rc3 does
appear to resolve the problem.
The cause of the suspend failure appears to be a long delay observed
sometimes when resuming from suspend, and this is causing our test to
timeout."
This reverts commit 431235818bc3a919ca7487500c67c3144feece80.
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpc_clnt_add_xprt take a reference to struct rpc_xprt_switch, but forget
to release it before return, may lead to a memory leak.
Signed-off-by: Lin Yi <teroincn@163.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Another round of SPDX updates for 5.2-rc6
Here is what I am guessing is going to be the last "big" SPDX update for
5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that
were "easy" to determine by pattern matching. The ones after this are
going to be a bit more difficult and the people on the spdx list will be
discussing them on a case-by-case basis now.
Another 5000+ files are fixed up, so our overall totals are:
Files checked: 64545
Files with SPDX: 45529
Compared to the 5.1 kernel which was:
Files checked: 63848
Files with SPDX: 22576
This is a huge improvement.
Also, we deleted another 20000 lines of boilerplate license crud, always
nice to see in a diffstat.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQyQYA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymnGQCghETUBotn1p3hTjY56VEs6dGzpHMAnRT0m+lv
kbsjBGEJpLbMRB2krnaU
=RMcT
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull still more SPDX updates from Greg KH:
"Another round of SPDX updates for 5.2-rc6
Here is what I am guessing is going to be the last "big" SPDX update
for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates
that were "easy" to determine by pattern matching. The ones after this
are going to be a bit more difficult and the people on the spdx list
will be discussing them on a case-by-case basis now.
Another 5000+ files are fixed up, so our overall totals are:
Files checked: 64545
Files with SPDX: 45529
Compared to the 5.1 kernel which was:
Files checked: 63848
Files with SPDX: 22576
This is a huge improvement.
Also, we deleted another 20000 lines of boilerplate license crud,
always nice to see in a diffstat"
* tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits)
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485
...
Alexei Starovoitov says:
====================
pull-request: bpf-next 2019-06-19
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin.
2) BTF based map definition, from Andrii.
3) support bpf_map_lookup_elem for xskmap, from Jonathan.
4) bounded loops and scalar precision logic in the verifier, from Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
empty_child_inc/dec() use the ternary operator for conditional
operations. The conditions involve the post/pre in/decrement
operator and the operation is only performed when the condition
is *not* true. This is hard to parse for humans, use a regular
'if' construct instead and perform the in/decrement separately.
This also fixes two warnings that are emitted about the value
of the ternary expression being unused, when building the kernel
with clang + "kbuild: Remove unnecessary -Wno-unused-value"
(https://lore.kernel.org/patchwork/patch/1089869/):
CC net/ipv4/fib_trie.o
net/ipv4/fib_trie.c:351:2: error: expression result unused [-Werror,-Wunused-value]
++tn_info(n)->empty_children ? : ++tn_info(n)->full_children;
Fixes: 95f60ea3e99a ("fib_trie: Add collapse() and should_collapse() to resize")
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A user reported that routes are getting installed with type 0 (RTN_UNSPEC)
where before the routes were RTN_UNICAST. One example is from accel-ppp
which apparently still uses the ioctl interface and does not set
rtmsg_type. Another is the netlink interface where ipv6 does not require
rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the
ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior.
Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even when running as VM guest (ie pr_iucv != NULL), af_iucv can still
open HiperTransport-based connections. For robust operation these
connections require the af_iucv_netdev_notifier, so register it
unconditionally.
Also handle any error that register_netdevice_notifier() returns.
Fixes: 9fbd87d41392 ("af_iucv: handle netdev events")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The HiperSockets-based transport path in af_iucv is still too closely
entangled with qeth.
With commit a647a02512ca ("s390/qeth: speed-up L3 IQD xmit"), the
relevant xmit code in qeth has begun to use skb_cow_head(). So to avoid
unnecessary skb head expansions, af_iucv must learn to
1) respect dev->needed_headroom when allocating skbs, and
2) drop the header reference before cloning the skb.
While at it, also stop hard-coding the LL-header creation stage and just
use the appropriate helper.
Fixes: a647a02512ca ("s390/qeth: speed-up L3 IQD xmit")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
af_iucv sockets over z/VM IUCV require that their skbs are allocated
in DMA memory. This restriction doesn't apply to connections over
HiperSockets. So only set this limit for z/VM IUCV sockets, thereby
increasing the likelihood that the large (and linear!) allocations for
HiperTransport messages succeed.
Fixes: 3881ac441f64 ("af_iucv: add HiperSockets transport")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current struct pernet_operations exit() handlers are highly
discouraged to call synchronize_rcu().
There are cases where we need them, and exit_batch() does
not help the common case where a single netns is dismantled.
This patch leverages the existing synchronize_rcu() call
in cleanup_net()
Calling optional ->pre_exit() method before ->exit() or
->exit_batch() allows to benefit from a single synchronize_rcu()
call.
Note that the synchronize_rcu() calls added in this patch
are only in error paths or slow paths.
Tested:
$ time for i in {1..1000}; do unshare -n /bin/false;done
real 0m2.612s
user 0m0.171s
sys 0m2.216s
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For DMA mapping use-case the page_pool keeps a pointer
to the struct device, which is used in DMA map/unmap calls.
For our in-flight handling, we also need to make sure that
the struct device have not disappeared. This is assured
via using get_device/put_device API.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reported-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The xdp tracepoints for mem id disconnect don't carry information about, why
it was not safe_to_remove. The tracepoint page_pool:page_pool_inflight in
this patch can be used for extract this info for further debugging.
This patchset also adds tracepoint for the pages_state_* release/hold
transitions, including a pointer to the page. This can be used for stats
about in-flight pages, or used to debug page leakage via keeping track of
page pointer and combining this with kprobe for __put_page().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These tracepoints make it easier to troubleshoot XDP mem id disconnect.
The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is
placed at the last stable place for the pointer to struct xdp_mem_allocator,
just before it's scheduled for RCU removal. It also extract info on
'safe_to_remove' and 'force'.
Detailed info about in-flight pages is not available at this layer. The next
patch will added tracepoints needed at the page_pool layer for this.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If bugs exists or are introduced later e.g. by drivers misusing the API,
then we want to warn about the issue, such that developer notice. This patch
will generate a bit of noise in form of periodic pr_warn every 30 seconds.
It is not nice to have this stall warning running forever. Thus, this patch
will (after 120 attempts) force disconnect the mem id (from the rhashtable)
and free the page_pool object. This will cause fallback to the put_page() as
before, which only potentially leak DMA-mappings, if objects are really
stuck for this long. In that unlikely case, a WARN_ONCE should show us the
call stack.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is needed before we can allow drivers to use page_pool for
DMA-mappings. Today with page_pool and XDP return API, it is possible to
remove the page_pool object (from rhashtable), while there are still
in-flight packet-pages. This is safely handled via RCU and failed lookups in
__xdp_return() fallback to call put_page(), when page_pool object is gone.
In-case page is still DMA mapped, this will result in page note getting
correctly DMA unmapped.
To solve this, the page_pool is extended with tracking in-flight pages. And
XDP disconnect system queries page_pool and waits, via workqueue, for all
in-flight pages to be returned.
To avoid killing performance when tracking in-flight pages, the implement
use two (unsigned) counters, that in placed on different cache-lines, and
can be used to deduct in-flight packets. This is done by mapping the
unsigned "sequence" counters onto signed Two's complement arithmetic
operations. This is e.g. used by kernel's time_after macros, described in
kernel commit 1ba3aab3033b and 5a581b367b5, and also explained in RFC1982.
The trick is these two incrementing counters only need to be read and
compared, when checking if it's safe to free the page_pool structure. Which
will only happen when driver have disconnected RX/alloc side. Thus, on a
non-fast-path.
It is chosen that page_pool tracking is also enabled for the non-DMA
use-case, as this can be used for statistics later.
After this patch, using page_pool requires more strict resource "release",
e.g. via page_pool_release_page() that was introduced in this patchset, and
previous patches implement/fix this more strict requirement.
Drivers no-longer call page_pool_destroy(). Drivers already call
xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
attempt to disconnect the mem id, and if attempt fails schedule the
disconnect for later via delayed workqueue.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case driver fails to register the page_pool with XDP return API (via
xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool
resources more directly than calling page_pool_destroy(), which does a
unnecessarily RCU free procedure.
This patch is preparing for removing page_pool_destroy(), from driver
invocation.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When converting an xdp_frame into an SKB, and sending this into the network
stack, then the underlying XDP memory model need to release associated
resources, because the network stack don't have callbacks for XDP memory
models. The only memory model that needs this is page_pool, when a driver
use the DMA-mapping feature.
Introduce page_pool_release_page(), which basically does the same as
page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
save the function call overhead for others. Have cpumap call
xdp_release_frame() before xdp_scrub_frame().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix error handling case, where inserting ID with rhashtable_insert_slow
fails in xdp_rxq_info_reg_mem_model, which leads to never releasing the IDA
ID, as the lookup in xdp_rxq_info_unreg_mem_model fails and thus
ida_simple_remove() is never called.
Fix by releasing ID via ida_simple_remove(), and mark xdp_rxq->mem.id with
zero, which is already checked in xdp_rxq_info_unreg_mem_model().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On a previous patch dma addr was stored in 'struct page'.
Use that to unmap DMA addresses used by network drivers
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on 1 normalized pattern(s):
gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 58 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081207.556988620@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation see readme and copying for
more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 9 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081207.060259192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 4122 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this source code is licensed under general public license version 2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 5 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081204.871734026@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this work is licensed under the terms of the gnu gpl version 2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 48 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license this
program is distributed in the hope that it will be useful but
without any warranty without even the implied warranty of
merchantability or fitness for a particular purpose see the gnu
general public license for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 53 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.904365654@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not see http www gnu org
licenses
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 503 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.811534538@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 2 normalized pattern(s):
this source code is licensed under the gnu general public license
version 2 see the file copying for more details
this source code is licensed under general public license version 2
see
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 52 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.449021192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>