Because of differences in how ipv4 and ipv6 handle fib lookups,
verification of nexthops with onlink flag need to default to the main
table rather than the local table used by IPv4. As it stands an
address within a connected route on device 1 can be used with
onlink on device 2. Updating the table properly rejects the route
due to the egress device mismatch.
Update the extack message as well to show it could be a device
mismatch for the nexthop spec.
Fixes: fc1e64e1092f ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Verification of nexthops with onlink flag need to handle unreachable
routes. The lookup is only intended to validate the gateway address
is not a local address and if the gateway resolves the egress device
must match the given device. Hence, hitting any default reject route
is ok.
Fixes: fc1e64e1092f ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
AF_RXRPC is incorrectly sending back to the server any abort it receives
for a client connection. This is due to the final-ACK offload to the
connection event processor patch. The abort code is copied into the
last-call information on the connection channel and then the event
processor is set.
Instead, the following should be done:
(1) In the case of a final-ACK for a successful call, the ACK should be
scheduled as before.
(2) In the case of a locally generated ABORT, the ABORT details should be
cached for sending in response to further packets related to that
call and no further action scheduled at call disconnect time.
(3) In the case of an ACK received from the peer, the call should be
considered dead, no ABORT should be transmitted at this time. In
response to further non-ABORT packets from the peer relating to this
call, an RX_USER_ABORT ABORT should be transmitted.
(4) In the case of a call killed due to network error, an RX_USER_ABORT
ABORT should be cached for transmission in response to further
packets, but no ABORT should be sent at this time.
Fixes: 3136ef49a14c ("rxrpc: Delay terminal ACK transmission on a client call")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for you net tree, they
are:
1) Restore __GFP_NORETRY in xt_table allocations to mitigate effects of
large memory allocation requests, from Michal Hocko.
2) Release IPv6 fragment queue in case of error in fragmentation header,
this is a follow up to amend patch 83f1999caeb1, from Subash Abhinov
Kasiviswanathan.
3) Flowtable infrastructure depends on NETFILTER_INGRESS as it registers
a hook for each flowtable, reported by John Crispin.
4) Missing initialization of info->priv in xt_cgroup version 1, from
Cong Wang.
5) Give a chance to garbage collector to run after scheduling flowtable
cleanup.
6) Releasing flowtable content on nft_flow_offload module removal is
not required at all, there is not dependencies between this module
and flowtables, remove it.
7) Fix missing xt_rateest_mutex grabbing for hash insertions, also from
Cong Wang.
8) Move nf_flow_table_cleanup() routine to flowtable core, this patch is
a dependency for the next patch in this list.
9) Flowtable resources are not properly released on removal from the
control plane. Fix this resource leak by scheduling removal of all
entries and explicit call to the garbage collector.
10) nf_ct_nat_offset() declaration is dead code, this function prototype
is not used anywhere, remove it. From Taehee Yoo.
11) Fix another flowtable resource leak on entry insertion failures,
this patch also fixes a possible use-after-free. Patch from Felix
Fietkau.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The response to a write_space notification is very latency sensitive,
so we should queue it to the lower latency xprtiod_workqueue. This
is something we already do for the other cases where an rpc task
holds the transport XPRT_LOCKED bitlock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
flow_offload_del frees the flow, so all associated resource must be
freed before.
Since the ct entry in struct flow_offload_entry was allocated by
flow_offload_alloc, it should be freed by flow_offload_free to take care
of the error handling path when flow_offload_add fails.
While at it, make flow_offload_del static, since it should never be
called directly, only from the gc step
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The DEFINE_SHOW_ATTRIBUTE() helper macro would be useful for current
users, which are many of them, and for new comers to decrease code
duplication.
Acked-by: Lee Jones <lee.jones@linaro.org>
Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
This macro is only used by net/ipv6/mcast.c, but there is no reason
why it must be BUILD_BUG_ON_NULL().
Replace it with BUILD_BUG_ON_ZERO(), and remove BUILD_BUG_ON_NULL()
definition from <linux/build_bug.h>.
Link: http://lkml.kernel.org/r/1515121833-3174-3-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Ian Abbott <abbotti@mev.co.uk>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
with bitmap_{from,to}_arr32 over the kernel. Additionally to it:
* __check_eq_bitmap() now takes single nbits argument.
* __check_eq_u32_array is not used in new test but may be used in
future. So I don't remove it here, but annotate as __used.
Tested on arm64 and 32-bit BE mips.
[arnd@arndb.de: perf: arm_dsu_pmu: convert to bitmap_from_arr32]
Link: http://lkml.kernel.org/r/20180201172508.5739-2-ynorov@caviumnetworks.com
[ynorov@caviumnetworks.com: fix net/core/ethtool.c]
Link: http://lkml.kernel.org/r/20180205071747.4ekxtsbgxkj5b2fz@yury-thinkpad
Link: http://lkml.kernel.org/r/20171228150019.27953-2-ynorov@caviumnetworks.com
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: David Decotigny <decot@googlers.com>,
Cc: David S. Miller <davem@davemloft.net>,
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every flow_offload entry is added into the table twice. Because of this,
rhashtable_free_and_destroy can't be used, since it would call kfree for
each flow_offload object twice.
This patch cleans up the flowtable via nf_flow_table_iterate() to
schedule removal of entries by setting on the dying bit, then there is
an explicitly invocation of the garbage collector to release resources.
Based on patch from Felix Fietkau.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move the flowtable cleanup routines to nf_flow_table and expose the
nf_flow_table_cleanup() helper function.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
rateest_hash is supposed to be protected by xt_rateest_mutex,
and, as suggested by Eric, lookup and insert should be atomic,
so we should acquire the xt_rateest_mutex once for both.
So introduce a non-locking helper for internal use and keep the
locking one for external.
Reported-by: <syzbot+5cb189720978275e4c75@syzkaller.appspotmail.com>
Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target")
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Most places in the kernel that we need to distinguish functions by the
type of their arguments, we use '_ul' as a suffix for the unsigned long
variant, not '_ext'. Also add kernel-doc.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
No real benefit to this classifier, but since we're allocating a u32
anyway, we should use this function.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Commit e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles)
converted htid allocation to use the IDR. The ID allocated by this
scheme changes; it used to be cyclic, but now always allocates the
lowest available. The IDR supports cyclic allocation, so just use
the right function.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Use the new helper. Also untangle the error path, and in so doing
noticed that estimator generator failure would lead to us leaking an
ID. Fix that bug.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Changing idr_replace's 'id' argument to 'unsigned long' works for all
callers. Callers which passed a negative ID now get -ENOENT instead of
-EINVAL. No callers relied on this error value.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Stable fixes:
- Fix calculating ri_max_send_sges, which can oops if max_sge is too small
- Fix a BUG after device removal if freed resources haven't been allocated yet
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlp6D5MACgkQ18tUv7Cl
QOt2IhAA3lY89PxSWhQRpQbxCAvnCATL2IjjT8z3v8eLJrNrkJ0X7mOUIP47dQHb
cvF/XNfemQssAqd9lWuKCRzx9vFbBrUmavzGyYXQBBULGZr6d9ARapIJU0pUJpv3
Mf/tGonj8FJSLT84dbbYpuU5xRUCS/WM53N0mYPUGOfWoOgL2vQQOHqgUDPLRjyx
T+2IHzCuyU0fp/QpSRCyg9OvScY9BOp0P0MER1UTH5+PGI5ApzBSd19Go3/gupj2
H8kuHsI800EWJbUynDFwDwsh+YDwDIs+WVkfEkD27LHhqT7L0W7RusV9xSzguoaJ
CK3g8L4GsqIKASQQCgqRJ1hcUbaHTHgqOvejIH8CBlH4YWMXanzT2aRh1AdDwERY
07n1zsJSyFZ4YMDRg786snWK/b3udeGv/eKS5/jEMxhNNNaTjPyceQHt3ZTNw/KO
/TYSvxYJWvaw/e4UoqcL7wb2rrr78uGTdLRDUcbtNllYk6hla4bUQkDp4D/rdoWk
vQoNpEvMty7r2+GosKgfZru/bOfZpAi8AXqyGPZPjz4LC/woYaro3HhvA4BLRRfO
NDsCSdejLO5iIssOvKfz99nxqD9VGkYF3rjxUlZ4ksQw+kdJ3IU+As5NLCdWd2xW
jCELPCXE2zlps6RjowJSykKqmtD0oGzdP3vVCiU4IbmVYF4LLpw=
=eXee
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-4.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
NFS-over-RDMA client fixes for Linux 4.16 #2
Stable fixes:
- Fix calculating ri_max_send_sges, which can oops if max_sge is too small
- Fix a BUG after device removal if freed resources haven't been allocated yet
When an erspan tunnel device receives an erpsan packet with different
tunnel metadata (ex: version, index, hwid, direction), existing code
overwrites the tunnel device's erspan configuration with the received
packet's metadata. The patch fixes it.
Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre")
Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 94d7d8f29287 ("ip6_gre: add erspan v2 support")
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d350a823020e ("net: erspan: create erspan metadata uapi header")
moves the erspan 'version' in front of the 'struct erspan_md2' for
later extensibility reason. This breaks the existing erspan metadata
extraction code because the erspan_md2 then has a 4-byte offset
to between the erspan_metadata and erspan_base_hdr. This patch
fixes it.
Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode")
Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code")
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Li Shuang reported an Oops with cls_u32 due to an use-after-free
in u32_destroy_key(). The use-after-free can be triggered with:
dev=lo
tc qdisc add dev $dev root handle 1: htb default 10
tc filter add dev $dev parent 1: prio 5 handle 1: protocol ip u32 divisor 256
tc filter add dev $dev protocol ip parent 1: prio 5 u32 ht 800:: match ip dst\
10.0.0.0/8 hashkey mask 0x0000ff00 at 16 link 1:
tc qdisc del dev $dev root
Which causes the following kasan splat:
==================================================================
BUG: KASAN: use-after-free in u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
Read of size 4 at addr ffff881b83dae618 by task kworker/u48:5/571
CPU: 17 PID: 571 Comm: kworker/u48:5 Not tainted 4.15.0+ #87
Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
Workqueue: tc_filter_workqueue u32_delete_key_freepf_work [cls_u32]
Call Trace:
dump_stack+0xd6/0x182
? dma_virt_map_sg+0x22e/0x22e
print_address_description+0x73/0x290
kasan_report+0x277/0x360
? u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
u32_destroy_key.constprop.21+0x117/0x140 [cls_u32]
u32_delete_key_freepf_work+0x1c/0x30 [cls_u32]
process_one_work+0xae0/0x1c80
? sched_clock+0x5/0x10
? pwq_dec_nr_in_flight+0x3c0/0x3c0
? _raw_spin_unlock_irq+0x29/0x40
? trace_hardirqs_on_caller+0x381/0x570
? _raw_spin_unlock_irq+0x29/0x40
? finish_task_switch+0x1e5/0x760
? finish_task_switch+0x208/0x760
? preempt_notifier_dec+0x20/0x20
? __schedule+0x839/0x1ee0
? check_noncircular+0x20/0x20
? firmware_map_remove+0x73/0x73
? find_held_lock+0x39/0x1c0
? worker_thread+0x434/0x1820
? lock_contended+0xee0/0xee0
? lock_release+0x1100/0x1100
? init_rescuer.part.16+0x150/0x150
? retint_kernel+0x10/0x10
worker_thread+0x216/0x1820
? process_one_work+0x1c80/0x1c80
? lock_acquire+0x1a5/0x540
? lock_downgrade+0x6b0/0x6b0
? sched_clock+0x5/0x10
? lock_release+0x1100/0x1100
? compat_start_thread+0x80/0x80
? do_raw_spin_trylock+0x190/0x190
? _raw_spin_unlock_irq+0x29/0x40
? trace_hardirqs_on_caller+0x381/0x570
? _raw_spin_unlock_irq+0x29/0x40
? finish_task_switch+0x1e5/0x760
? finish_task_switch+0x208/0x760
? preempt_notifier_dec+0x20/0x20
? __schedule+0x839/0x1ee0
? kmem_cache_alloc_trace+0x143/0x320
? firmware_map_remove+0x73/0x73
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x170
? find_held_lock+0x39/0x1c0
? schedule+0xf3/0x3b0
? lock_downgrade+0x6b0/0x6b0
? __schedule+0x1ee0/0x1ee0
? do_wait_intr_irq+0x340/0x340
? do_raw_spin_trylock+0x190/0x190
? _raw_spin_unlock_irqrestore+0x32/0x60
? process_one_work+0x1c80/0x1c80
? process_one_work+0x1c80/0x1c80
kthread+0x312/0x3d0
? kthread_create_worker_on_cpu+0xc0/0xc0
ret_from_fork+0x3a/0x50
Allocated by task 1688:
kasan_kmalloc+0xa0/0xd0
__kmalloc+0x162/0x380
u32_change+0x1220/0x3c9e [cls_u32]
tc_ctl_tfilter+0x1ba6/0x2f80
rtnetlink_rcv_msg+0x4f0/0x9d0
netlink_rcv_skb+0x124/0x320
netlink_unicast+0x430/0x600
netlink_sendmsg+0x8fa/0xd60
sock_sendmsg+0xb1/0xe0
___sys_sendmsg+0x678/0x980
__sys_sendmsg+0xc4/0x210
do_syscall_64+0x232/0x7f0
return_from_SYSCALL_64+0x0/0x75
Freed by task 112:
kasan_slab_free+0x71/0xc0
kfree+0x114/0x320
rcu_process_callbacks+0xc3f/0x1600
__do_softirq+0x2bf/0xc06
The buggy address belongs to the object at ffff881b83dae600
which belongs to the cache kmalloc-4096 of size 4096
The buggy address is located 24 bytes inside of
4096-byte region [ffff881b83dae600, ffff881b83daf600)
The buggy address belongs to the page:
page:ffffea006e0f6a00 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
flags: 0x17ffffc0008100(slab|head)
raw: 0017ffffc0008100 0000000000000000 0000000000000000 0000000100070007
raw: dead000000000100 dead000000000200 ffff880187c0e600 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff881b83dae500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff881b83dae580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff881b83dae600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff881b83dae680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff881b83dae700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
The problem is that the htnode is freed before the linked knodes and the
latter will try to access the first at u32_destroy_key() time.
This change addresses the issue using the htnode refcnt to guarantee
the correct free order. While at it also add a RCU annotation,
to keep sparse happy.
v1 -> v2: use rtnl_derefence() instead of RCU read locks
v2 -> v3:
- don't check refcnt in u32_destroy_hnode()
- cleaned-up u32_destroy() implementation
- cleaned-up code comment
v3 -> v4:
- dropped unneeded comment
Reported-by: Li Shuang <shuali@redhat.com>
Fixes: c0d378ef1266 ("net_sched: use tcf_queue_work() in u32 filter")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support the AFS dynamic root which is a pseudo-volume that doesn't connect
to any server resource, but rather is just a root directory that
dynamically creates mountpoint directories where the name of such a
directory is the name of the cell.
Such a mount can be created thus:
mount -t afs none /afs -o dyn
Dynamic root superblocks aren't shared except by bind mounts and
propagation. Cell root volumes can then be mounted by referring to them by
name, e.g.:
ls /afs/grand.central.org/
ls /afs/.grand.central.org/
The kernel will upcall to consult the DNS if the address wasn't supplied
directly.
Signed-off-by: David Howells <dhowells@redhat.com>
Create a UID field and enum that can be used to assign ULPs to
sockets. This saves a set of string comparisons if the ULP id
is known.
For sockmap, which is added in the next patches, a ULP is used to
hook into TCP sockets close state. In this case the ULP being added
is done at map insert time and the ULP is known and done on the kernel
side. In this case the named lookup is not needed. Because we don't
want to expose psock internals to user space socket options a user
visible flag is also added. For TLS this is set for BPF it will be
cleared.
Alos remove pr_notice, user gets an error code back and should check
that rather than rely on logs.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fix dst reference count leak in sctp_v4_get_dst() introduced in commit
410f03831 ("sctp: add routing output fallback"):
When walking the address_list, successive ip_route_output_key() calls
may return the same rt->dst with the reference incremented on each call.
The code would not decrement the dst refcount when the dst pointer was
identical from the previous iteration, causing the dst refcnt leak.
Testcase:
ip netns add TEST
ip netns exec TEST ip link set lo up
ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link set dev dummy0 netns TEST
ip link set dev dummy1 netns TEST
ip link set dev dummy2 netns TEST
ip netns exec TEST ip addr add 192.168.1.1/24 dev dummy0
ip netns exec TEST ip link set dummy0 up
ip netns exec TEST ip addr add 192.168.1.2/24 dev dummy1
ip netns exec TEST ip link set dummy1 up
ip netns exec TEST ip addr add 192.168.1.3/24 dev dummy2
ip netns exec TEST ip link set dummy2 up
ip netns exec TEST sctp_test -H 192.168.1.2 -P 20002 -h 192.168.1.1 -p 20000 -s -B 192.168.1.3
ip netns del TEST
In 4.4 and 4.9 kernels this results to:
[ 354.179591] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 364.419674] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 374.663664] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 384.903717] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 395.143724] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 405.383645] unregister_netdevice: waiting for lo to become free. Usage count = 1
...
Fixes: 410f03831 ("sctp: add routing output fallback")
Fixes: 0ca50d12f ("sctp: fix src address selection if using secondary addresses")
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When going through the bind address list in sctp_v6_get_dst() and
the previously found address is better ('matchlen > bmatchlen'),
the code continues to the next iteration without releasing currently
held destination.
Fix it by releasing 'bdst' before continue to the next iteration, and
instead of introducing one more '!IS_ERR(bdst)' check for dst_release(),
move the already existed one right after ip6_dst_lookup_flow(), i.e. we
shouldn't proceed further if we get an error for the route lookup.
Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary addresses for ipv6")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure that we release the TCP socket once it is in the TCP_CLOSE or
TCP_TIME_WAIT state (and only then) so that we don't confuse rkhunter
and its ilk.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Setting values in struct sock directly is the usual method. Remove
the long dead code using set_fs() and the related comment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Alexei Starovoitov says:
====================
pull-request: bpf 2018-02-02
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) support XDP attach in libbpf, from Eric.
2) minor fixes, from Daniel, Jakub, Yonghong, Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull spectre/meltdown updates from Thomas Gleixner:
"The next round of updates related to melted spectrum:
- The initial set of spectre V1 mitigations:
- Array index speculation blocker and its usage for syscall,
fdtable and the n180211 driver.
- Speculation barrier and its usage in user access functions
- Make indirect calls in KVM speculation safe
- Blacklisting of known to be broken microcodes so IPBP/IBSR are not
touched.
- The initial IBPB support and its usage in context switch
- The exposure of the new speculation MSRs to KVM guests.
- A fix for a regression in x86/32 related to the cpu entry area
- Proper whitelisting for known to be safe CPUs from the mitigations.
- objtool fixes to deal proper with retpolines and alternatives
- Exclude __init functions from retpolines which speeds up the boot
process.
- Removal of the syscall64 fast path and related cleanups and
simplifications
- Removal of the unpatched paravirt mode which is yet another source
of indirect unproteced calls.
- A new and undisputed version of the module mismatch warning
- A couple of cleanup and correctness fixes all over the place
Yet another step towards full mitigation. There are a few things still
missing like the RBS underflow mitigation for Skylake and other small
details, but that's being worked on.
That said, I'm taking a belated christmas vacation for a week and hope
that everything is magically solved when I'm back on Feb 12th"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
KVM/x86: Add IBPB support
KVM/x86: Update the reverse_cpuid list to include CPUID_7_EDX
x86/speculation: Fix typo IBRS_ATT, which should be IBRS_ALL
x86/pti: Mark constant arrays as __initconst
x86/spectre: Simplify spectre_v2 command line parsing
x86/retpoline: Avoid retpolines for built-in __init functions
x86/kvm: Update spectre-v1 mitigation
KVM: VMX: make MSR bitmaps per-VCPU
x86/paravirt: Remove 'noreplace-paravirt' cmdline option
x86/speculation: Use Indirect Branch Prediction Barrier in context switch
x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
x86/spectre: Fix spelling mistake: "vunerable"-> "vulnerable"
x86/spectre: Report get_user mitigation for spectre_v1
nl80211: Sanitize array index in parse_txq_params
vfs, fdtable: Prevent bounds-check bypass via speculative execution
x86/syscall: Sanitize syscall table de-references under speculation
x86/get_user: Use pointer masking to limit speculation
...
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass all
hardened usercopy checks since these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over the
next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJabvleAAoJEIly9N/cbcAmO1kQAJnjVPutnLSbnUteZxtsv7W4
43Cggvokfxr6l08Yh3hUowNxZVKjhF9uwMVgRRg9Nl5WdYCN+vCQbHz+ZdzGJXKq
cGqdKWgexMKX+aBdNDrK7BphUeD46sH7JWR+a/lDV/BgPxBCm9i5ZZCgXbPP89AZ
NpLBji7gz49wMsnm/x135xtNlZ3dG0oKETzi7MiR+NtKtUGvoIszSKy5JdPZ4m8q
9fnXmHqmwM6uQFuzDJPt1o+D1fusTuYnjI7EgyrJRRhQ+BB3qEFZApXnKNDRS9Dm
uB7jtcwefJCjlZVCf2+PWTOEifH2WFZXLPFlC8f44jK6iRW2Nc+wVRisJ3vSNBG1
gaRUe/FSge68eyfQj5OFiwM/2099MNkKdZ0fSOjEBeubQpiFChjgWgcOXa5Bhlrr
C4CIhFV2qg/tOuHDAF+Q5S96oZkaTy5qcEEwhBSW15ySDUaRWFSrtboNt6ZVOhug
d8JJvDCQWoNu1IQozcbv6xW/Rk7miy8c0INZ4q33YUvIZpH862+vgDWfTJ73Zy9H
jR/8eG6t3kFHKS1vWdKZzOX1bEcnd02CGElFnFYUEewKoV7ZeeLsYX7zodyUAKyi
Yp5CImsDbWWTsptBg6h9nt2TseXTxYCt2bbmpJcqzsqSCUwOQNQ4/YpuzLeG0ihc
JgOmUnQNJWCTwUUw5AS1
=tzmJ
-----END PGP SIGNATURE-----
Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy whitelisting from Kees Cook:
"Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs.
To further restrict what memory is available for copying, this creates
a way to whitelist specific areas of a given slab cache object for
copying to/from userspace, allowing much finer granularity of access
control.
Slab caches that are never exposed to userspace can declare no
whitelist for their objects, thereby keeping them unavailable to
userspace via dynamic copy operations. (Note, an implicit form of
whitelisting is the use of constant sizes in usercopy operations and
get_user()/put_user(); these bypass all hardened usercopy checks since
these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over
the next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage"
* tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
lkdtm: Update usercopy tests for whitelisting
usercopy: Restrict non-usercopy caches to size 0
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
kvm: whitelist struct kvm_vcpu_arch
arm: Implement thread_struct whitelist for hardened usercopy
arm64: Implement thread_struct whitelist for hardened usercopy
x86: Implement thread_struct whitelist for hardened usercopy
fork: Provide usercopy whitelisting for task_struct
fork: Define usercopy region in thread_stack slab caches
fork: Define usercopy region in mm_struct slab caches
net: Restrict unwhitelisted proto caches to size 0
sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
sctp: Define usercopy region in SCTP proto slab cache
caif: Define usercopy region in caif proto slab cache
ip: Define usercopy region in IP proto slab cache
net: Define usercopy region in struct proto slab cache
scsi: Define usercopy region in scsi_sense_cache slab cache
cifs: Define usercopy region in cifs_request slab cache
vxfs: Define usercopy region in vxfs_inode slab cache
ufs: Define usercopy region in ufs_inode_cache slab cache
...
Pull networking fixes from David Miller:
1) The bnx2x can hang if you give it a GSO packet with a segment size
which is too big for the hardware, detect and drop in this case.
From Daniel Axtens.
2) Fix some overflows and pointer leaks in xtables, from Dmitry Vyukov.
3) Missing RCU locking in igmp, from Eric Dumazet.
4) Fix RX checksum handling on r8152, it can only checksum UDP and TCP
packets. From Hayes Wang.
5) Minor pacing tweak to TCP BBR congestion control, from Neal
Cardwell.
6) Missing RCU annotations in cls_u32, from Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
Revert "defer call to mem_cgroup_sk_alloc()"
soreuseport: fix mem leak in reuseport_add_sock()
net: qlge: use memmove instead of skb_copy_to_linear_data
net: qed: use correct strncpy() size
net: cxgb4: avoid memcpy beyond end of source buffer
cls_u32: add missing RCU annotation.
r8152: set rx mode early when linking on
r8152: fix wrong checksum status for received IPv4 packets
nfp: fix TLV offset calculation
net: pxa168_eth: add netconsole support
net: igmp: add a missing rcu locking section
ibmvnic: fix firmware version when no firmware level has been provided by the VIOS server
vmxnet3: remove redundant initialization of pointer 'rq'
lan78xx: remove redundant initialization of pointer 'phydev'
net: jme: remove unused initialization of 'rxdesc'
rtnetlink: remove check for IFLA_IF_NETNSID
rocker: fix possible null pointer dereference in rocker_router_fib_event_work
inet: Avoid unitialized variable warning in inet_unhash()
net: bridge: Fix uninitialized error in br_fdb_sync_static()
openvswitch: Remove padding from packet before L3+ conntrack processing
...
This patch effectively reverts commit 9f1c2674b328 ("net: memcontrol:
defer call to mem_cgroup_sk_alloc()").
Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
memcg socket memory accounting, as packets received before memcg
pointer initialization are not accounted and are causing refcounting
underflow on socket release.
Actually the free-after-use problem was fixed by
commit c0576e397508 ("net: call cgroup_sk_alloc() earlier in
sk_clone_lock()") for the cgroup pointer.
So, let's revert it and call mem_cgroup_sk_alloc() just before
cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
we're cloning, and it holds a reference to the memcg.
Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
mem_cgroup_sk_alloc(). I see no reasons why bumping the root
memcg counter is a good reason to panic, and there are no realistic
ways to hit it.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
reuseport_add_sock() needs to deal with attaching a socket having
its own sk_reuseport_cb, after a prior
setsockopt(SO_ATTACH_REUSEPORT_?BPF)
Without this fix, not only a WARN_ONCE() was issued, but we were also
leaking memory.
Thanks to sysbot and Eric Biggers for providing us nice C repros.
------------[ cut here ]------------
socket already in reuseport group
WARNING: CPU: 0 PID: 3496 at net/core/sock_reuseport.c:119
reuseport_add_sock+0x742/0x9b0 net/core/sock_reuseport.c:117
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 3496 Comm: syzkaller869503 Not tainted 4.15.0-rc6+ #245
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
panic+0x1e4/0x41c kernel/panic.c:183
__warn+0x1dc/0x200 kernel/panic.c:547
report_bug+0x211/0x2d0 lib/bug.c:184
fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
fixup_bug arch/x86/kernel/traps.c:247 [inline]
do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
Fixes: ef456144da8e ("soreuseport: define reuseport groups")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+c0ea2226f77a42936bf7@syzkaller.appspotmail.com
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a couple of points of the control path, n->ht_down is currently
accessed without the required RCU annotation. The accesses are
safe, but sparse complaints. Since we already held the
rtnl lock, let use rtnl_dereference().
Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs")
Fixes: de5df63228fc ("net: sched: cls_u32 changes to knode must appear atomic to readers")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kalderon reports a BUG that occurs just after device removal:
[ 169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049
[ 169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma]
The RPC/RDMA client transport attempts to allocate some resources
on demand. Registered buffers are one such resource. These are
allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call
and Reply messages. A hardware resource is associated with each of
these buffers, as they can be used for a Send or Receive Work
Request.
If a device is removed from under an NFS/RDMA mount, the transport
layer is responsible for releasing all hardware resources before
the device can be finally unplugged. A BUG results when the NFS
mount hasn't yet seen much activity: the transport tries to release
resources that haven't yet been allocated.
rpcrdma_free_regbuf() already checks for this case, so just move
that check to cover the DEVICE_REMOVAL case as well.
Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 16f906d66cd7 ("xprtrdma: Reduce required number of send
SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
a problem where xprtrdma would not work if the device's max_sge
capability was small (low single digits).
At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
each RPC. ri_max_send_sges is set to this value:
ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;
Then when marshaling each RPC, rpcrdma_args_inline uses that value
to determine whether the device has enough Send SGEs to convey an
NFS WRITE payload inline, or whether instead a Read chunk is
required.
More recently, commit ae72950abf99 ("xprtrdma: Add data structure to
manage RDMA Send arguments") used the ri_max_send_sges value to
calculate the size of an array, but that commit erroneously assumed
ri_max_send_sges contains a value similar to the device's max_sge,
and not one that was reduced by the minimum SGE count.
This assumption results in the calculated size of the sendctx's
Send SGE array to be too small. When the array is used to marshal
an RPC, the code can write Send SGEs into the following sendctx
element in that array, corrupting it. When the device's max_sge is
large, this issue is entirely harmless; but it results in an oops
in the provider's post_send method, if dev.attrs.max_sge is small.
So let's straighten this out: ri_max_send_sges will now contain a
value with the same meaning as dev.attrs.max_sge, which makes
the code easier to understand, and enables rpcrdma_sendctx_create
to calculate the size of the SGE array correctly.
Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: 16f906d66cd7 ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
nft_flow_offload module removal does not require to flush existing
flowtables, it is valid to remove this module while keeping flowtables
around.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If netdevice goes down, then flowtable entries are scheduled to be
removed. Wait for garbage collector to have a chance to run so it can
delete them from the hashtable.
The flush call might sleep, so hold the nfnl mutex from
nft_flow_table_iterate() instead of rcu read side lock. The use of the
nfnl mutex is also implicitly fixing races between updates via nfnetlink
and netdevice event.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xt_cgroup_info_v1->priv is an internal pointer only used for kernel,
we should not trust what user-space provides.
Reported-by: <syzbot+4fbcfcc0d2e6592bd641@syzkaller.appspotmail.com>
Fixes: c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
config NF_FLOW_TABLE depends on NETFILTER_INGRESS. If users forget to
enable this toggle, flowtable registration fails with EOPNOTSUPP.
Moreover, turn 'select NF_FLOW_TABLE' in every flowtable family flavour
into dependency instead, otherwise this new dependency on
NETFILTER_INGRESS causes a warning. This also allows us to remove the
explicit dependency between family flowtables <-> NF_TABLES and
NF_CONNTRACK, given they depend on the NF_FLOW_TABLE core that already
expresses the general dependencies for this new infrastructure.
Moreover, NF_FLOW_TABLE_INET depends on NF_FLOW_TABLE_IPV4 and
NF_FLOWTABLE_IPV6, which already depends on NF_FLOW_TABLE. So we can get
rid of direct dependency with NF_FLOW_TABLE.
In general, let's avoid 'select', it just makes things more complicated.
Reported-by: John Crispin <john@phrozen.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Failures were seen in ICMPv6 fragmentation timeout tests if they were
run after the RFC2460 failure tests. Kernel was not sending out the
ICMPv6 fragment reassembly time exceeded packet after the fragmentation
reassembly timeout of 1 minute had elapsed.
This happened because the frag queue was not released if an error in
IPv6 fragmentation header was detected by RFC2460.
Fixes: 83f1999caeb1 ("netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>