Clean up the fib6_null_entry handling in ip6_pol_route_lookup.
rt6_device_match can return fib6_null_entry, but fib6_multipath_select
can not. Consolidate the fib6_null_entry handling and on the final
null_entry check set rt and goto out - no need to defer to a second
check after rt6_find_cached_rt.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
find_rr_leaf has 3 loops over fib_entries calling find_match. The loops
are very similar with differences in start point and whether the metric
is evaluated:
1. start at rr_head, no extra loop compare, check fib metric
2. start at leaf, compare rt against rr_head, check metric
3. start at cont (potential saved point from earlier loops), no
extra loop compare, no metric check
Create 1 loop that is called 3 different times. This will make a
later change with multipath nexthop objects much simpler.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
find_match primarily needs a fib6_nh (and fib6_flags which it passes
through to rt6_score_route). Move fib6_check_expired up to the call
sites so find_match is only called for relevant entries. Remove the
match argument which is mostly a pass through and use the return
boolean to decide if match gets set in the call sites.
The end result is a helper that can be called per fib6_nh struct
which is needed once fib entries reference nexthop objects that
have more than one fib6_nh.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_score_route only needs the fib6_flags and nexthop data. Change
it accordingly. Allows re-use later for nexthop based fib6_nh.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_probe sends probes for gateways in a nexthop. As such it really
depends on a fib6_nh, not a fib entry. Move last_probe to fib6_nh and
update rt6_probe to a fib6_nh struct.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_check_dev is a simpler helper with only 1 caller. Fold the code
into rt6_score_route.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change rt6_check_neigh to take a fib6_nh instead of a fib entry.
Move the check on fib_flags and whether the nexthop has a gateway
up to the one caller.
Remove the inline from the definition as well. Not necessary.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to have separate modules for this.
before:
text data bss dec filename
2038 1168 0 3206 net/ipv4/netfilter/ipt_MASQUERADE.ko
1526 1024 0 2550 net/ipv6/netfilter/ip6t_MASQUERADE.ko
after:
text data bss dec filename
2521 1296 0 3817 net/netfilter/xt_MASQUERADE.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Both are now implemented by nf_nat_masquerade.c, so no need to keep
different headers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
erspan_v6 tunnels run __iptunnel_pull_header on received skbs to remove
erspan header. This can determine a possible use-after-free accessing
pkt_md pointer in ip6erspan_rcv since the packet will be 'uncloned'
running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
been sent though a veth device). Fix it resetting pkt_md pointer after
__iptunnel_pull_header
Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A later patch allows an IPv6 gateway with an IPv4 route. The neighbor
entry will exist in the v6 ndisc table and the cached header will contain
the ipv6 protocol which is wrong for an IPv4 packet. For an IPv4 packet to
use the v6 neighbor entry, neigh_output needs to skip the cached header
and just use the output callback for the neigh entry.
A future patchset can look at expanding the hh_cache to handle 2
protocols. For now, IPv6 gateways with an IPv4 route will take the
extra overhead of generating the header.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the gateway in a fib_nh_common to be from a different address
family than the outer fib{6}_nh. To that end, replace nhc_has_gw with
nhc_gw_family and update users of nhc_has_gw to check nhc_gw_family.
Now nhc_family is used to know if the nh_common is part of a fib_nh
or fib6_nh (used for container_of to get to route family specific data),
and nhc_gw_family represents the address family for the gateway.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add fib6_nh_init and fib6_nh_release to ipv6_stubs. If fib6_nh_init fails,
callers should not invoke fib6_nh_release, so there is no reason to have
a dummy stub for the IPv6 is not enabled case.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
very little code, so it really doesn't make sense to have extra
modules or even a kconfig knob for this.
Merge them and make functionality available unconditionally.
The merge makes inet family route support trivial, so add it
as well here.
Before:
text data bss dec hex filename
835 832 0 1667 683 nft_chain_route_ipv4.ko
870 832 0 1702 6a6 nft_chain_route_ipv6.ko
111568 2556 529 114653 1bfdd nf_tables.ko
After:
text data bss dec hex filename
113133 2556 529 116218 1c5fa nf_tables.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After commit a297569fe00a ("net/udp: do not touch skb->peeked unless
really needed") the 'peeked' argument of __skb_try_recv_datagram()
and friends is always equal to !!'flags & MSG_PEEK'.
Since such argument is really a boolean info, and the callers have
already 'flags & MSG_PEEK' handy, we can remove it and clean-up the
code a bit.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This structure is now only 4 bytes, so its more efficient
to cache a copy rather than its address.
No significant size difference in allmodconfig vmlinux.
With non-modular kernel that has all XFRM options enabled, this
series reduces vmlinux image size by ~11kb. All xfrm_mode
indirections are gone and all modes are built-in.
before (ipsec-next master):
text data bss dec filename
21071494 7233140 11104324 39408958 vmlinux.master
after this series:
21066448 7226772 11104324 39397544 vmlinux.patched
With allmodconfig kernel, the size increase is only 362 bytes,
even all the xfrm config options removed in this series are
modular.
before:
text data bss dec filename
15731286 6936912 4046908 26715106 vmlinux.master
after this series:
15731492 6937068 4046908 26715468 vmlinux
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
after previous changes, xfrm_mode contains no function pointers anymore
and all modules defining such struct contain no code except an init/exit
functions to register the xfrm_mode struct with the xfrm core.
Just place the xfrm modes core and remove the modules,
the run-time xfrm_mode register/unregister functionality is removed.
Before:
text data bss dec filename
7523 200 2364 10087 net/xfrm/xfrm_input.o
40003 628 440 41071 net/xfrm/xfrm_state.o
15730338 6937080 4046908 26714326 vmlinux
7389 200 2364 9953 net/xfrm/xfrm_input.o
40574 656 440 41670 net/xfrm/xfrm_state.o
15730084 6937068 4046908 26714060 vmlinux
The xfrm*_mode_{transport,tunnel,beet} modules are gone.
v2: replace CONFIG_INET6_XFRM_MODE_* IS_ENABLED guards with CONFIG_IPV6
ones rather than removing them.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Adds an EXPORT_SYMBOL for afinfo_get_rcu, as it will now be called from
ipv6 in case of CONFIG_IPV6=m.
This change has virtually no effect on vmlinux size, but it reduces
afinfo size and allows followup patch to make xfrm modes const.
v2: mark if (afinfo) tests as likely (Sabrina)
re-fetch afinfo according to inner_mode in xfrm_prepare_input().
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
similar to previous patch: no external module dependencies,
so we can avoid the indirection by placing this in the core.
This change removes the last indirection from xfrm_mode and the
xfrm4|6_mode_{beet,tunnel}.c modules contain (almost) no code anymore.
Before:
text data bss dec hex filename
3957 136 0 4093 ffd net/xfrm/xfrm_output.o
587 44 0 631 277 net/ipv4/xfrm4_mode_beet.o
649 32 0 681 2a9 net/ipv4/xfrm4_mode_tunnel.o
625 44 0 669 29d net/ipv6/xfrm6_mode_beet.o
599 32 0 631 277 net/ipv6/xfrm6_mode_tunnel.o
After:
text data bss dec hex filename
5359 184 0 5543 15a7 net/xfrm/xfrm_output.o
171 24 0 195 c3 net/ipv4/xfrm4_mode_beet.o
171 24 0 195 c3 net/ipv4/xfrm4_mode_tunnel.o
172 24 0 196 c4 net/ipv6/xfrm6_mode_beet.o
172 24 0 196 c4 net/ipv6/xfrm6_mode_tunnel.o
v2: fold the *encap_add functions into xfrm*_prepare_output
preserve (move) output2 comment (Sabrina)
use x->outer_mode->encap, not inner
fix a build breakage on ppc (kbuild robot)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
No external dependencies on any module, place this in the core.
Increase is about 1800 byte for xfrm_input.o.
The beet helpers get added to internal header, as they can be reused
from xfrm_output.c in the next patch (kernel contains several
copies of them in the xfrm{4,6}_mode_beet.c files).
Before:
text data bss dec filename
5578 176 2364 8118 net/xfrm/xfrm_input.o
1180 64 0 1244 net/ipv4/xfrm4_mode_beet.o
171 40 0 211 net/ipv4/xfrm4_mode_transport.o
1163 40 0 1203 net/ipv4/xfrm4_mode_tunnel.o
1083 52 0 1135 net/ipv6/xfrm6_mode_beet.o
172 40 0 212 net/ipv6/xfrm6_mode_ro.o
172 40 0 212 net/ipv6/xfrm6_mode_transport.o
1056 40 0 1096 net/ipv6/xfrm6_mode_tunnel.o
After:
text data bss dec filename
7373 200 2364 9937 net/xfrm/xfrm_input.o
587 44 0 631 net/ipv4/xfrm4_mode_beet.o
171 32 0 203 net/ipv4/xfrm4_mode_transport.o
649 32 0 681 net/ipv4/xfrm4_mode_tunnel.o
625 44 0 669 net/ipv6/xfrm6_mode_beet.o
172 32 0 204 net/ipv6/xfrm6_mode_ro.o
172 32 0 204 net/ipv6/xfrm6_mode_transport.o
599 32 0 631 net/ipv6/xfrm6_mode_tunnel.o
v2: pass inner_mode to xfrm_inner_mode_encap_remove to fix
AF_UNSPEC selector breakage (bisected by Benedict Wong)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
There are only two versions (tunnel and transport). The ip/ipv6 versions
are only differ in sizeof(iphdr) vs ipv6hdr.
Place this in the core and use x->outer_mode->encap type to call the
correct adjustment helper.
Before:
text data bss dec filename
15730311 6937008 4046908 26714227 vmlinux
After:
15730428 6937008 4046908 26714344 vmlinux
(about 117 byte increase)
v2: use family from x->outer_mode, not inner
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Same is input indirection. Only exception: we need to export
xfrm_outer_mode_output for pktgen.
Increases size of vmlinux by about 163 byte:
Before:
text data bss dec filename
15730208 6936948 4046908 26714064 vmlinux
After:
15730311 6937008 4046908 26714227 vmlinux
xfrm_inner_extract_output has no more external callers, make it static.
v2: add IS_ENABLED(IPV6) guard in xfrm6_prepare_output
add two missing breaks in xfrm_outer_mode_output (Sabrina Dubroca)
add WARN_ON_ONCE for 'call AF_INET6 related output function, but
CONFIG_IPV6=n' case.
make xfrm_inner_extract_output static
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
No need for any indirection or abstraction here, both functions
are pretty much the same and quite small, they also have no external
dependencies.
xfrm_prepare_input can then be made static.
With allmodconfig build, size increase of vmlinux is 25 byte:
Before:
text data bss dec filename
15730207 6936924 4046908 26714039 vmlinux
After:
15730208 6936948 4046908 26714064 vmlinux
v2: Fix INET_XFRM_MODE_TRANSPORT name in is-enabled test (Sabrina Dubroca)
change copied comment to refer to transport and network header,
not skb->{h,nh}, which don't exist anymore. (Sabrina)
make xfrm_prepare_input static (Eyal Birger)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Now that we have the family available directly in the
xfrm_mode struct, we can use that and avoid one extra dereference.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This will be useful to know if we're supposed to decode ipv4 or ipv6.
While at it, make the unregister function return void, all module_exit
functions did just BUG(); there is never a point in doing error checks
if there is no way to handle such error.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
bucket pointer to lock the hash chain for that bucket.
The benefits of a bit spin_lock are:
- no need to allocate a separate array of locks.
- no need to have a configuration option to guide the
choice of the size of this array
- locking cost is often a single test-and-set in a cache line
that will have to be loaded anyway. When inserting at, or removing
from, the head of the chain, the unlock is free - writing the new
address in the bucket head implicitly clears the lock bit.
For __rhashtable_insert_fast() we ensure this always happens
when adding a new key.
- even when lockings costs 2 updates (lock and unlock), they are
in a cacheline that needs to be read anyway.
The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.
Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved. This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.
As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.
To enhance type checking, a new struct is introduced to represent the
pointer plus lock-bit
that is stored in the bucket-table. This is "struct rhash_lock_head"
and is empty. A pointer to this needs to be cast to either an
unsigned lock, or a "struct rhash_head *" to be useful.
Variables of this type are most often called "bkt".
Previously "pprev" would sometimes point to a bucket, and sometimes a
->next pointer in an rhash_head. As these are now different types,
pprev is NULL when it would have pointed to the bucket. In that case,
'blk' is used, together with correct locking protocol.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor comment merge conflict in mlx5.
Staging driver has a fixup due to the skb->xmit_more changes
in 'net-next', but was removed in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Export fib_nexthop_info and fib_add_nexthop for use by IPv6 code.
Remove rt6_nexthop_info and rt6_add_nexthop in favor of the IPv4
versions. Update fib_nexthop_info for IPv6 linkdown check and
RTA_GATEWAY for AF_INET6.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the beginning of ip6_fragment func, the prevhdr pointer is
obtained in the ip6_find_1stfragopt func.
However, all the pointers pointing into skb header may change
when calling skb_checksum_help func with
skb->ip_summed = CHECKSUM_PARTIAL condition.
The prevhdr pointe will be dangling if it is not reloaded after
calling __skb_linearize func in skb_checksum_help func.
Here, I add a variable, nexthdr_offset, to evaluate the offset,
which does not changes even after calling __skb_linearize func.
Fixes: 405c92f7a541 ("ipv6: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wenhao Zhang <zhangwenhao8@huawei.com>
Reported-by: syzbot+e8ce541d095e486074fc@syzkaller.appspotmail.com
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The device type for ip6 tunnels is set to
ARPHRD_TUNNEL6. However, the ip4ip6_err function
is expecting the device type of the tunnel to be
ARPHRD_TUNNEL. Since the device types do not
match, the function exits and the ICMP error
packet is not sent to the originating host. Note
that the device type for IPv4 tunnels is set to
ARPHRD_TUNNEL.
Fix is to expect a tunnel device type of
ARPHRD_TUNNEL6 instead. Now the tunnel device
type matches and the ICMP error packet is sent
to the originating host.
Signed-off-by: Sheena Mira-ato <sheena.mira-ato@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that struct ipv6_pinfo is not located as we think.
inet6_sk_generic() and tcp_inet6_sk() disagree on 32bit kernels by 4-bytes,
because struct tcp_sock has 8-bytes alignment,
but ipv6_pinfo size is not a multiple of 8.
sizeof(struct ipv6_pinfo): 116 (not padded to 8)
I actually first coded tcp_inet6_sk() as this patch does, but thought
that "container_of(tcp_sk(sk), struct tcp6_sock, tcp)" was cleaner.
As Julian told me : Nobody should use tcp6_sock.inet6
directly, it should be accessed via tcp_inet6_sk() or inet6_sk().
This happened when we added the first u64 field in struct tcp_sock.
Fixes: 93a77c11ae79 ("tcp: add tcp_inet6_sk() helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
The number of stubs is growing and has nothing to do with addrconf.
Move the definition of the stubs to a separate header file and update
users. In the move, drop the vxlan specific comment before ipv6_stub.
Code move only; no functional change intended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With fib_nh_common in place, move common initialization and release
code into helpers used by both ipv4 and ipv6. For the moment, the init
is just the lwt encap and the release is both the netdev reference and
the the lwt state reference. More will be added later.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add fib_nh_common struct with common nexthop attributes. Convert
fib_nh and fib6_nh to use it. Use macros to move existing
fib_nh_* references to the new nh_common.nhc_*.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename fib6_nh entries that will be moved to a fib_nh_common struct.
Specifically, the device, gateway, flags, and lwtstate are common
with all nexthop definitions. In some places new temporary variables
are declared or local variables renamed to maintain line lengths.
Rename only; no functional change intended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_add_nexthop and rt6_nexthop_info only need the fib6_info for the
gateway flag and the nexthop weight, and the presence of a gateway is now
per-nexthop. Update the signatures to take a fib6_nh and nexthop weight
and better align with the ipv4 versions.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib6_ignore_linkdown takes a fib6_info but only looks at the net_device
and its IPv6 config. Change it to take a net_device over a fib6_info as
its input argument.
In addition, move it to a header file to make the check inline and usable
later with IPv4 code without going through the ipv6 stub, and rename to
ip6_ignore_linkdown since it is only checking the setting based on the
ipv6 struct on a device.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The gateway setting is not per fib6_info entry but per-fib6_nh. Add a new
fib_nh_has_gw flag to fib6_nh and convert references to RTF_GATEWAY to
the new flag. For IPv6 address the flag is cheaper than checking that
nh_gw is non-0 like IPv4 does.
While this increases fib6_nh by 8-bytes, the effective allocation size of
a fib6_info is unchanged. The 8 bytes is recovered later with a
fib_nh_common change.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the fib6_nh cleanup code to a new helper, fib6_nh_release.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to IPv4, consolidate the fib6_nh initialization into a helper.
As a new standalone function, add a cleanup path to put lwtstate on
error.
To avoid modifying fib6_config flags, move the reject check to a helper
that is invoked once by fib6_nh_init to reset the device and then
again in ip6_route_info_create to set the fib6_flags.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ila_xlat_nl_cmd_flush uses rhashtable walkers allocated from the
stack but it never frees them. This corrupts the walker list of
the hash table.
This patch fixes it.
Reported-by: syzbot+dae72a112334aa65a159@syzkaller.appspotmail.com
Fixes: b6e71bdebb12 ("ila: Flush netlink command to clear xlat...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.
Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.
It is time to switch to siphash and its 128bit keys.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 6a53b7593233 ("xfrm: check id proto in validate_tmpl()")
I introduced a check for xfrm protocol, but according to Herbert
IPSEC_PROTO_ANY should only be used as a wildcard for lookup, so
it should be removed from validate_tmpl().
And, IPSEC_PROTO_ANY is expected to only match 3 IPSec-specific
protocols, this is why xfrm_state_flush() could still miss
IPPROTO_ROUTING, which leads that those entries are left in
net->xfrm.state_all before exit net. Fix this by replacing
IPSEC_PROTO_ANY with zero.
This patch also extracts the check from validate_tmpl() to
xfrm_id_proto_valid() and uses it in parse_ipsecrequest().
With this, no other protocols should be added into xfrm.
Fixes: 6a53b7593233 ("xfrm: check id proto in validate_tmpl()")
Reported-by: syzbot+0bf0519d6e0de15914fe@syzkaller.appspotmail.com
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.
This means the incoming skb had to be allocated on a cpu,
but freed on another.
This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.
A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.
More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.
This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.
(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
instead of 8 Mpps)
This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :
- CPU handling the NIC rx interrupts, feeding the receive queue,
and (after this patch) freeing the skbs that were consumed.
- CPU in recvmsg() system call, essentially 100 % busy copying out
data to user space.
Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.
Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.
To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since maxattr is common, the policy can't really differ sanely,
so make it common as well.
The only user that did in fact manage to make a non-common policy
is taskstats, which has to be really careful about it (since it's
still using a common maxattr!). This is no longer supported, but
we can fake it using pre_doit.
This reduces the size of e.g. nl80211.o (which has lots of commands):
text data bss dec hex filename
398745 14323 2240 415308 6564c net/wireless/nl80211.o (before)
397913 14331 2240 414484 65314 net/wireless/nl80211.o (after)
--------------------------------
-832 +8 0 -824
Which is obviously just 8 bytes for each command, and an added 8
bytes for the new policy pointer. I'm not sure why the ops list is
counted as .text though.
Most of the code transformations were done using the following spatch:
@ops@
identifier OPS;
expression POLICY;
@@
struct genl_ops OPS[] = {
...,
{
- .policy = POLICY,
},
...
};
@@
identifier ops.OPS;
expression ops.POLICY;
identifier fam;
expression M;
@@
struct genl_family fam = {
.ops = OPS,
.maxattr = M,
+ .policy = POLICY,
...
};
This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
the cb->data as ops, which we want to change in a later genl patch.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net and null_fallback are redundant. Remove null_fallback in favor of
!net check.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>