[ Upstream commit 4b29dba9c0 ]
The VXLAN neigh_reduce() code is completely non-functional since
check-in. Specific errors:
1) The original code drops all packets with a multicast destination address,
even though neighbor solicitations are sent to the solicited-node
address, a multicast address. The code after this check was never run.
2) The neighbor table lookup used the IPv6 header destination, which is the
solicited node address, rather than the target address from the
neighbor solicitation. So neighbor lookups would always fail if it
got this far. Also for L3MISSes.
3) The code calls ndisc_send_na(), which does a send on the tunnel device.
The context for neigh_reduce() is the transmit path, vxlan_xmit(),
where the host or a bridge-attached neighbor is trying to transmit
a neighbor solicitation. To respond to it, the tunnel endpoint needs
to do a *receive* of the appropriate neighbor advertisement. Doing a
send, would only try to send the advertisement, encapsulated, to the
remote destinations in the fdb -- hosts that definitely did not do the
corresponding solicitation.
4) The code uses the tunnel endpoint IPv6 forwarding flag to determine the
isrouter flag in the advertisement. This has nothing to do with whether
or not the target is a router, and generally won't be set since the
tunnel endpoint is bridging, not routing, traffic.
The patch below creates a proxy neighbor advertisement to respond to
neighbor solicitions as intended, providing proper IPv6 support for neighbor
reduction.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7346135dcd ]
This patch fixes a NULL pointer dereference in the event of an
skb allocation failure in arp_reduce().
Signed-Off-By: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ff0992e903 ]
This is a context modified revert of commit 6a9612e2cb
("net: cdc_ncm: remove ncm_parm field") which introduced
a NCM specification violation, causing setup errors for
some devices. These errors resulted in the device and
host disagreeing about shared settings, with complete
failure to communicate as the end result.
The NCM specification require that many of the NCM specific
control reuests are sent only while the NCM Data Interface
is in alternate setting 0. Reverting the commit ensures that
we follow this requirement.
Fixes: 6a9612e2cb ("net: cdc_ncm: remove ncm_parm field")
Reported-and-tested-by: Pasi Kärkkäinen <pasik@iki.fi>
Reported-by: Thomas Schäfer <tschaefer@t-online.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e367c2d03d ]
In ip6_append_data_mtu(), when the xfrm mode is not tunnel(such as
transport),the ipsec header need to be added in the first fragment, so the mtu
will decrease to reserve space for it, then the second fragment come, the mtu
should be turn back, as the commit 0c1833797a
said. however, in the commit a493e60ac4bbe2e977e7129d6d8cbb0dd236be, it use
*mtu = min(*mtu, ...) to change the mtu, which lead to the new mtu is alway
equal with the first fragment's. and cannot turn back.
when I test through ping6 -c1 -s5000 $ip (mtu=1280):
...frag (0|1232) ESP(spi=0x00002000,seq=0xb), length 1232
...frag (1232|1216)
...frag (2448|1216)
...frag (3664|1216)
...frag (4880|164)
which should be:
...frag (0|1232) ESP(spi=0x00001000,seq=0x1), length 1232
...frag (1232|1232)
...frag (2464|1232)
...frag (3696|1232)
...frag (4928|116)
so delete the min() when change back the mtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Fixes: 75a493e60a ("ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ecab67015e ]
tmp_prefered_lft is an offset to ifp->tstamp, not now. Therefore
age needs to be added to the condition.
Age calculation in ipv6_create_tempaddr is different from the one
in addrconf_verify and doesn't consider ADDRCONF_TIMER_FUZZ_MINUS.
This can cause age in ipv6_create_tempaddr to be less than the one
in addrconf_verify and therefore unnecessary temporary address to
be generated.
Use age calculation as in addrconf_modify to avoid this.
Signed-off-by: Heiner Kallweit <heiner.kallweit@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 84fe61821e ]
If the Freescale fec is in promiscuous mode and network cable is
reconnected then the promiscuous mode get lost. The problem is caused
by a too soon call of set_multicast_list to re-enable promisc mode.
The FEC_R_CNTRL register changes are overwritten by fec_restart.
This patch fixes this by moving the call behind the init of FEC_R_CNTRL
register in fec_restart.
Successful tested on a i.MX28 board.
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fb00bc2e6c ]
The commit d3ab3ffd1d
(bonding: use rlb_client_info->vlan_id instead of ->tag)
remove the rlb_client_info->tag, but occur some issues,
The vlan_get_tag() will return 0 for success and -EINVAL for
error, so the client_info->vlan_id always be set to 0 if the
vlan_get_tag return 0 for success, so the client_info would
never get a correct vlan id.
We should only set the vlan id to 0 when the vlan_get_tag return error.
Fixes: d3ab3ffd1d (bonding: use rlb_client_info->vlan_id instead of ->tag)
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dbb490b965 ]
When copying in a struct msghdr from the user, if the user has set the
msg_namelen parameter to a negative value it gets clamped to a valid
size due to a comparison between signed and unsigned values.
Ensure the syscall errors when the user passes in a negative value.
Signed-off-by: Matthew Leach <matthew.leach@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 20a599bec9 ]
Without this check someone could easily create a denial of service
by injecting multicast-specific queries to enable the bridge
snooping part if no real querier issuing periodic general queries
is present on the link which would result in the bridge wrongly
shutting down ports for multicast traffic as the bridge did not learn
about these listeners.
With this patch the snooping code is enabled upon receiving valid,
general queries only.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9ed973cc40 ]
General IGMP and MLD queries are supposed to have the multicast
link-local all-nodes address as their destination according to RFC2236
section 9, RFC3376 section 4.1.12/9.1, RFC2710 section 8 and RFC3810
section 5.1.15.
Without this check, such malformed IGMP/MLD queries can result in a
denial of service: The queries are ignored by most IGMP/MLD listeners
therefore they will not respond with an IGMP/MLD report. However,
without this patch these malformed MLD queries would enable the
snooping part in the bridge code, potentially shutting down the
according ports towards these hosts for multicast traffic as the
bridge did not learn about these listeners.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1fd819ecb9 ]
skb_segment copies frags around, so we need
to copy them carefully to avoid accessing
user memory after reporting completion to userspace
through a callback.
skb_segment doesn't normally happen on datapath:
TSO needs to be disabled - so disabling zero copy
in this case does not look like a big deal.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1a4cedaf65 ]
fskb is unrelated to frag: it's coming from
frag_list. Rename it list_skb to avoid confusion.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit df5771ffef ]
rename local variable to make it easier to tell at a glance that we are
dealing with a head skb.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4e1beba12d ]
skb_frag can in fact point at either skb
or fskb so rename it generally "frag".
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8cb19905e9 ]
frag points at nskb, so name it appropriately
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dd38743b4c ]
With TX VLAN offload enabled the source MAC address for frames sent using the
VLAN interface is currently set to the address of the real interface. This is
wrong since the VLAN interface may be configured with a different address.
The bug was introduced in commit 2205369a31
("vlan: Fix header ops passthru when doing TX VLAN offload.").
This patch sets the source address before calling the create function of the
real interface.
Signed-off-by: Peter Boström <peter.bostrom@netrounds.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5bd0767086 ]
Current netback uses gso_type to check whether the skb contains
gso offload, and this is wrong. Gso_size is the right one to
check gso existence, and gso_type is only used to check gso type.
Some skbs contains nonzero gso_type and zero gso_size, current
netback would treat these skbs as gso and create wrong response
for this. This also causes ssh failure to domu from other server.
V2: use skb_is_gso function as Paul Durrant suggested
Signed-off-by: Annie Li <annie.li@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2d8d40afd1 ]
Resizing fq hash table allocates memory while holding qdisc spinlock,
with BH disabled.
This is definitely not good, as allocation might sleep.
We can drop the lock and get it when needed, we hold RTNL so no other
changes can happen at the same time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a8d9bc2e9f ]
The pci shutdown handler added in:
bnx2: Add pci shutdown handler
commit 25bfb1dd4b
created a shutdown down sequence without chip reset if the device was
never brought up. This can cause the firmware to shutdown the PHY
prematurely and cause MMIO read cycles to be unresponsive. On some
systems, it may generate NMI in the bnx2's pci shutdown handler.
The fix is to tell the firmware not to shutdown the PHY if there was
no prior chip reset.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c88507fbad ]
DST_NOCOUNT should only be used if an authorized user adds routes
locally. In case of routes which are added on behalf of router
advertisments this flag must not get used as it allows an unlimited
number of routes getting added remotely.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d2d273ffab ]
Without this fix, ipv6_exthdrs_offload_init doesn't register IPPROTO_DSTOPTS
offload, but returns 0 (as the IPPROTO_ROUTING registration actually succeeds).
This then causes the ipv6_gso_segment to drop IPv6 packets with IPPROTO_DSTOPTS
header.
The issue detected and the fix verified by running MS HCK Offload LSO test on
top of QEMU Windows guests, as this test sends IPv6 packets with
IPPROTO_DSTOPTS.
Signed-off-by: Anton Nayshtut <anton@swortex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit de14439167 ]
Some applications didn't expect recvmsg() on a non blocking socket
could return -EINTR. This possibility was added as a side effect
of commit b3ca9b02b0 ("net: fix multithreaded signal handling in
unix recv routines").
To hit this bug, you need to be a bit unlucky, as the u->readlock
mutex is usually held for very small periods.
Fixes: b3ca9b02b0 ("net: fix multithreaded signal handling in unix recv routines")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e588e2f286 ]
Quoting Alexander Aring:
While fragmentation and unloading of 6lowpan module I got this kernel Oops
after few seconds:
BUG: unable to handle kernel paging request at f88bbc30
[..]
Modules linked in: ipv6 [last unloaded: 6lowpan]
Call Trace:
[<c012af4c>] ? call_timer_fn+0x54/0xb3
[<c012aef8>] ? process_timeout+0xa/0xa
[<c012b66b>] run_timer_softirq+0x140/0x15f
Problem is that incomplete frags are still around after unload; when
their frag expire timer fires, we get crash.
When a netns is removed (also done when unloading module), inet_frag
calls the evictor with 'force' argument to purge remaining frags.
The evictor loop terminates when accounted memory ('work') drops to 0
or the lru-list becomes empty. However, the mem accounting is done
via percpu counters and may not be accurate, i.e. loop may terminate
prematurely.
Alter evictor to only stop once the lru list is empty when force is
requested.
Reported-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Reported-by: Alexander Aring <alex.aring@gmail.com>
Tested-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2892505ea1 ]
Failure to schedule a TIPC tasklet with tipc_k_signal because the
tasklet handler is disabled is not an error. It means TIPC is
currently in the process of shutting down. We remove the error
logging in this case.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1bb8dce57f ]
When the TIPC module is removed, the tasklet handler is disabled
before all other subsystems. This will cause lingering publications
in the name table because the node_down tasklets responsible to
clean up publications from an unreachable node will never run.
When the name table is shut down, these publications are detected
and an error message is logged:
tipc: nametbl_stop(): orphaned hash chain detected
This is actually a memory leak, introduced with commit
993b858e37 ("tipc: correct the order
of stopping services at rmmod")
Instead of just logging an error and leaking memory, we free
the orphaned entries during nametable shutdown.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit edcc0511b5 ]
When a topology server subscriber is disconnected, the associated
connection id is set to zero. A check vs zero is then done in the
subscription timeout function to see if the subscriber have been
shut down. This is unnecessary, because all subscription timers
will be cancelled when a subscriber terminates. Setting the
connection id to zero is actually harmful because id zero is the
identity of the topology server listening socket, and can cause a
race that leads to this socket being closed instead.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4652edb70e ]
When tipc_conn_sendmsg() calls tipc_conn_lookup() to query a
connection instance, its reference count value is increased if
it's found. But subsequently if it's found that the connection is
closed, the work of sending message is not queued into its server
send workqueue, and the connection reference count is not decreased.
This will cause a reference count leak. To reproduce this problem,
an application would need to open and closes topology server
connections with high intensity.
We fix this by immediately decrementing the connection reference
count if a send fails due to the connection being closed.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6d4ebeb4df ]
Currently connection shutdown callback function is called when
connection instance is released in tipc_conn_kref_release(), and
receiving packets and sending packets are running in different
threads. Even if connection is closed by the thread of receiving
packets, its shutdown callback may not be called immediately as
the connection reference count is non-zero at that moment. So,
although the connection is shut down by the thread of receiving
packets, the thread of sending packets doesn't know it. Before
its shutdown callback is invoked to tell the sending thread its
connection has been closed, the sending thread may deliver
messages by tipc_conn_sendmsg(), this is why the following error
information appears:
"Sending subscription event failed, no memory"
To eliminate it, allow connection shutdown callback function to
be called before connection id is removed in tipc_close_conn(),
which makes the sending thread know the truth in time that its
socket is closed so that it doesn't send message to it. We also
remove the "Sending XXX failed..." error reporting for topology
and config services.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6565b9eeef ]
MLD queries are supposed to have an IPv6 link-local source address
according to RFC2710, section 4 and RFC3810, section 5.1.14. This patch
adds a sanity check to ignore such broken MLD queries.
Without this check, such malformed MLD queries can result in a
denial of service: The queries are ignored by any MLD listener
therefore they will not respond with an MLD report. However,
without this patch these malformed MLD queries would enable the
snooping part in the bridge code, potentially shutting down the
according ports towards these hosts for multicast traffic as the
bridge did not learn about these listeners.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c485658bae ]
While working on ec0223ec48 ("net: sctp: fix sctp_sf_do_5_1D_ce to
verify if we/peer is AUTH capable"), we noticed that there's a skb
memory leakage in the error path.
Running the same reproducer as in ec0223ec48 and by unconditionally
jumping to the error label (to simulate an error condition) in
sctp_sf_do_5_1D_ce() receive path lets kmemleak detector bark about
the unfreed chunk->auth_chunk skb clone:
Unreferenced object 0xffff8800b8f3a000 (size 256):
comm "softirq", pid 0, jiffies 4294769856 (age 110.757s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
89 ab 75 5e d4 01 58 13 00 00 00 00 00 00 00 00 ..u^..X.........
backtrace:
[<ffffffff816660be>] kmemleak_alloc+0x4e/0xb0
[<ffffffff8119f328>] kmem_cache_alloc+0xc8/0x210
[<ffffffff81566929>] skb_clone+0x49/0xb0
[<ffffffffa0467459>] sctp_endpoint_bh_rcv+0x1d9/0x230 [sctp]
[<ffffffffa046fdbc>] sctp_inq_push+0x4c/0x70 [sctp]
[<ffffffffa047e8de>] sctp_rcv+0x82e/0x9a0 [sctp]
[<ffffffff815abd38>] ip_local_deliver_finish+0xa8/0x210
[<ffffffff815a64af>] nf_reinject+0xbf/0x180
[<ffffffffa04b4762>] nfqnl_recv_verdict+0x1d2/0x2b0 [nfnetlink_queue]
[<ffffffffa04aa40b>] nfnetlink_rcv_msg+0x14b/0x250 [nfnetlink]
[<ffffffff815a3269>] netlink_rcv_skb+0xa9/0xc0
[<ffffffffa04aa7cf>] nfnetlink_rcv+0x23f/0x408 [nfnetlink]
[<ffffffff815a2bd8>] netlink_unicast+0x168/0x250
[<ffffffff815a2fa1>] netlink_sendmsg+0x2e1/0x3f0
[<ffffffff8155cc6b>] sock_sendmsg+0x8b/0xc0
[<ffffffff8155d449>] ___sys_sendmsg+0x369/0x380
What happens is that commit bbd0d59809 clones the skb containing
the AUTH chunk in sctp_endpoint_bh_rcv() when having the edge case
that an endpoint requires COOKIE-ECHO chunks to be authenticated:
---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
<------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
------------------ AUTH; COOKIE-ECHO ---------------->
<-------------------- COOKIE-ACK ---------------------
When we enter sctp_sf_do_5_1D_ce() and before we actually get to
the point where we process (and subsequently free) a non-NULL
chunk->auth_chunk, we could hit the "goto nomem_init" path from
an error condition and thus leave the cloned skb around w/o
freeing it.
The fix is to centrally free such clones in sctp_chunk_destroy()
handler that is invoked from sctp_chunk_free() after all refs have
dropped; and also move both kfree_skb(chunk->auth_chunk) there,
so that chunk->auth_chunk is either NULL (since sctp_chunkify()
allocs new chunks through kmem_cache_zalloc()) or non-NULL with
a valid skb pointer. chunk->skb and chunk->auth_chunk are the
only skbs in the sctp_chunk structure that need to be handeled.
While at it, we should use consume_skb() for both. It is the same
as dev_kfree_skb() but more appropriately named as we are not
a device but a protocol. Also, this effectively replaces the
kfree_skb() from both invocations into consume_skb(). Functions
are the same only that kfree_skb() assumes that the frame was
being dropped after a failure (e.g. for tools like drop monitor),
usage of consume_skb() seems more appropriate in function
sctp_chunk_destroy() though.
Fixes: bbd0d59809 ("[SCTP]: Implement the receive and verification of AUTH chunk")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Vlad Yasevich <yasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 24b9bf43e9 ]
I stumbled upon this very serious bug while hunting for another one,
it's a very subtle race condition between inet_frag_evictor,
inet_frag_intern and the IPv4/6 frag_queue and expire functions
(basically the users of inet_frag_kill/inet_frag_put).
What happens is that after a fragment has been added to the hash chain
but before it's been added to the lru_list (inet_frag_lru_add) in
inet_frag_intern, it may get deleted (either by an expired timer if
the system load is high or the timer sufficiently low, or by the
fraq_queue function for different reasons) before it's added to the
lru_list, then after it gets added it's a matter of time for the
evictor to get to a piece of memory which has been freed leading to a
number of different bugs depending on what's left there.
I've been able to trigger this on both IPv4 and IPv6 (which is normal
as the frag code is the same), but it's been much more difficult to
trigger on IPv4 due to the protocol differences about how fragments
are treated.
The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
in a RR bond, so the same flow can be seen on multiple cards at the
same time. Then I used multiple instances of ping/ping6 to generate
fragmented packets and flood the machines with them while running
other processes to load the attacked machine.
*It is very important to have the _same flow_ coming in on multiple CPUs
concurrently. Usually the attacked machine would die in less than 30
minutes, if configured properly to have many evictor calls and timeouts
it could happen in 10 minutes or so.
An important point to make is that any caller (frag_queue or timer) of
inet_frag_kill will remove both the timer refcount and the
original/guarding refcount thus removing everything that's keeping the
frag from being freed at the next inet_frag_put. All of this could
happen before the frag was ever added to the LRU list, then it gets
added and the evictor uses a freed fragment.
An example for IPv6 would be if a fragment is being added and is at
the stage of being inserted in the hash after the hash lock is
released, but before inet_frag_lru_add executes (or is able to obtain
the lru lock) another overlapping fragment for the same flow arrives
at a different CPU which finds it in the hash, but since it's
overlapping it drops it invoking inet_frag_kill and thus removing all
guarding refcounts, and afterwards freeing it by invoking
inet_frag_put which removes the last refcount added previously by
inet_frag_find, then inet_frag_lru_add gets executed by
inet_frag_intern and we have a freed fragment in the lru_list.
The fix is simple, just move the lru_add under the hash chain locked
region so when a removing function is called it'll have to wait for
the fragment to be added to the lru_list, and then it'll remove it (it
works because the hash chain removal is done before the lru_list one
and there's no window between the two list adds when the frag can get
dropped). With this fix applied I couldn't kill the same machine in 24
hours with the same setup.
Fixes: 3ef0eb0db4 ("net: frag, move LRU list maintenance outside of
rwlock")
CC: Florian Westphal <fw@strlen.de>
CC: Jesper Dangaard Brouer <brouer@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f64410ec66 upstream.
This patch is based on an earlier patch by Eric Paris, he describes
the problem below:
"If an inode is accessed before policy load it will get placed on a
list of inodes to be initialized after policy load. After policy
load we call inode_doinit() which calls inode_doinit_with_dentry()
on all inodes accessed before policy load. In the case of inodes
in procfs that means we'll end up at the bottom where it does:
/* Default to the fs superblock SID. */
isec->sid = sbsec->sid;
if ((sbsec->flags & SE_SBPROC) && !S_ISLNK(inode->i_mode)) {
if (opt_dentry) {
isec->sclass = inode_mode_to_security_class(...)
rc = selinux_proc_get_sid(opt_dentry,
isec->sclass,
&sid);
if (rc)
goto out_unlock;
isec->sid = sid;
}
}
Since opt_dentry is null, we'll never call selinux_proc_get_sid()
and will leave the inode labeled with the label on the superblock.
I believe a fix would be to mimic the behavior of xattrs. Look
for an alias of the inode. If it can't be found, just leave the
inode uninitialized (and pick it up later) if it can be found, we
should be able to call selinux_proc_get_sid() ..."
On a system exhibiting this problem, you will notice a lot of files in
/proc with the generic "proc_t" type (at least the ones that were
accessed early in the boot), for example:
# ls -Z /proc/sys/kernel/shmmax | awk '{ print $4 " " $5 }'
system_u:object_r:proc_t:s0 /proc/sys/kernel/shmmax
However, with this patch in place we see the expected result:
# ls -Z /proc/sys/kernel/shmmax | awk '{ print $4 " " $5 }'
system_u:object_r:sysctl_kernel_t:s0 /proc/sys/kernel/shmmax
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b22f5126a2 upstream.
Some occurences in the netfilter tree use skb_header_pointer() in
the following way ...
struct dccp_hdr _dh, *dh;
...
skb_header_pointer(skb, dataoff, sizeof(_dh), &dh);
... where dh itself is a pointer that is being passed as the copy
buffer. Instead, we need to use &_dh as the forth argument so that
we're copying the data into an actual buffer that sits on the stack.
Currently, we probably could overwrite memory on the stack (e.g.
with a possibly mal-formed DCCP packet), but unintentionally, as
we only want the buffer to be placed into _dh variable.
Fixes: 2bc780499a ("[NETFILTER]: nf_conntrack: add DCCP protocol support")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0ab02ca8f8 upstream.
Setup cgroupfs like this:
# mount -t cgroup -o cpuacct xxx /cgroup
# mkdir /cgroup/sub1
# mkdir /cgroup/sub2
Then run these two commands:
# for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } &
# for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } &
After seconds you may see this warning:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
idr_remove called for id=6 which is not allocated.
...
Call Trace:
[<ffffffff8156063c>] dump_stack+0x7a/0x96
[<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0
[<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50
[<ffffffff81300aa7>] sub_remove+0x87/0x1b0
[<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0
[<ffffffff81300bf5>] idr_remove+0x25/0xd0
[<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0
[<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0
[<ffffffff8107cdbc>] process_one_work+0x26c/0x550
[<ffffffff8107eefe>] worker_thread+0x12e/0x3b0
[<ffffffff81085f96>] kthread+0xe6/0xf0
[<ffffffff81570bac>] ret_from_fork+0x7c/0xb0
---[ end trace 2d1577ec10cf80d0 ]---
It's because allocating/removing cgroup ID is not properly synchronized.
The bug was introduced when we converted cgroup_ida to cgroup_idr.
While synchronization is already done inside ida_simple_{get,remove}(),
users are responsible for concurrent calls to idr_{alloc,remove}().
tj: Refreshed on top of b58c89986a ("cgroup: fix error return from
cgroup_create()").
[mhocko@suse.cz: ported to 3.12]
Fixes: 4e96ee8e98 ("cgroup: convert cgroup_ida to cgroup_idr")
Cc: <stable@vger.kernel.org> #3.12+
Reported-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 668f9abbd4 upstream.
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).
This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.
This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.
This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.
Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 38129a13e6 upstream.
fixes RCU bug - walking through hlist is safe in face of element moves,
since it's self-terminating. Cyclic lists are not - if we end up jumping
to another hash chain, we'll loop infinitely without ever hitting the
original list head.
[fix for dumb braino folded]
Spotted by: Max Kellermann <mk@cm4all.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0b1b901b5a upstream.
If the dest_mnt is not shared, propagate_mnt() does nothing -
there's no mounts to propagate to and thus no copies to create.
Might as well don't bother calling it in that case.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 05efa8c943 upstream.
Commit 4af712e8df ("random32: add prandom_reseed_late() and call when
nonblocking pool becomes initialized") has added a late reseed stage
that happens as soon as the nonblocking pool is marked as initialized.
This fails in the case that the nonblocking pool gets initialized
during __prandom_reseed()'s call to get_random_bytes(). In that case
we'd double back into __prandom_reseed() in an attempt to do a late
reseed - deadlocking on 'lock' early on in the boot process.
Instead, just avoid even waiting to do a reseed if a reseed is already
occuring.
Fixes: 4af712e8df ("random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e3a8786c10 upstream.
Commit 5445eaf309 ('mvneta: Try to fix mvneta when compiled as
module') fixed the mvneta driver to make it work properly when loaded
as a module in SGMII configuration, which was tested successful by the
author on the Armada XP OpenBlocks AX3, which uses SGMII.
However, it turns out that the Armada XP GP, which uses RGMII, is
affected by a similar problem: its SERDES configuration is lost when
mvneta is loaded as a module, because this configuration is set by the
bootloader, and then lost because the clock is gated by the clock
framework until the mvneta driver is loaded again and the clock is
re-enabled.
However, it turns out that for the RGMII case, setting the SERDES
configuration is not sufficient: the PCS enable bit in the
MVNETA_GMAC_CTRL_2 register must also be set, like in the SGMII
configuration.
Therefore, this commit reworks the SGMII/RGMII initialization: the
only difference between the two now is a different SERDES
configuration, all the rest is identical.
In detail, to achieve this, the commit:
* Renames MVNETA_SGMII_SERDES_CFG to MVNETA_SERDES_CFG because it is
not specific to SGMII, but also used on RGMII configurations.
* Adds a MVNETA_RGMII_SERDES_PROTO definition, that must be used as
the MVNETA_SERDES_CFG value in RGMII configurations.
* Removes the mvneta_gmac_rgmii_set() and mvneta_port_sgmii_config()
functions, and instead directly do the SGMII/RGMII configuration in
mvneta_port_up(), from where those functions where called. It is
worth mentioning that mvneta_gmac_rgmii_set() had an 'enable'
parameter that was always passed as '1', so it was pretty useless.
* Reworks the mvneta_port_up() function to set the MVNETA_SERDES_CFG
register to the appropriate value depending on the RGMII vs. SGMII
configuration. It also unconditionally set the PCS_ENABLE bit (was
already done for SGMII, but is now also needed for RGMII), and sets
the PORT_RGMII bit (which was already done for both SGMII and
RGMII).
This commit was successfully tested with mvneta compiled as a module,
on both the OpenBlocks AX3 (SGMII configuration) and the Armada XP GP
(RGMII configuration).
Reported-by: Steve McIntyre <steve@einval.com>
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a79121d3b5 upstream.
Bit 3 of the MVNETA_GMAC_CTRL_2 is actually used to enable the PCS,
not the PSC: there was a typo in the name of the define, which this
commit fixes.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e825196d48 upstream.
In all callchains leading to prepend_name(), the value left in *buflen
is eventually discarded unused if prepend_name() has returned a negative.
So we are free to do what prepend() does, and subtract from *buflen
*before* checking for underflow (which turns into checking the sign
of subtraction result, of course).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 825600c0f2 upstream.
On x86 uniprocessor systems topology_physical_package_id() returns -1
which causes rapl_cpu_prepare() to leave rapl_pmu variable uninitialized
which leads to GPF in rapl_pmu_init().
See arch/x86/kernel/cpu/perf_event_intel_rapl.c.
It turns out that physical_package_id and core_id can actually be
retreived for uniprocessor systems too. Enabling them also fixes
rapl_pmu code.
Signed-off-by: Artem Fetishev <artem_fetishev@epam.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5f12c5eca6 upstream.
Fixes a build break due to the undeclared use of irq_of_parse_and_map()
and of_iomap(). This build break was apparently introduced while the
driver was unbuildable due to the bug fixed by
62c19c9d29 ("i2c: Remove usage of
orphaned symbol OF_I2C"). When 62c19c was added in v3.14-rc7,
the driver was enabled again, breaking the powerpc mpc85xx_defconfig
and mpc85xx_smp_defconfig.
62c19c is marked for stable, so this should go there as well.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5926f87fda upstream.
This reverts commit a9c8e4beee.
PTEs in Xen PV guests must contain machine addresses if _PAGE_PRESENT
is set and pseudo-physical addresses is _PAGE_PRESENT is clear.
This is because during a domain save/restore (migration) the page
table entries are "canonicalised" and uncanonicalised". i.e., MFNs are
converted to PFNs during domain save so that on a restore the page
table entries may be rewritten with the new MFNs on the destination.
This canonicalisation is only done for PTEs that are present.
This change resulted in writing PTEs with MFNs if _PAGE_PROTNONE (or
_PAGE_NUMA) was set but _PAGE_PRESENT was clear. These PTEs would be
migrated as-is which would result in unexpected behaviour in the
destination domain. Either a) the MFN would be translated to the
wrong PFN/page; b) setting the _PAGE_PRESENT bit would clear the PTE
because the MFN is no longer owned by the domain; or c) the present
bit would not get set.
Symptoms include "Bad page" reports when munmapping after migrating a
domain.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 09ed3d5ba0 upstream.
Xen balloon driver will update ballooned out pages' P2M entries to point
to scratch page for PV guests. In 24f69373e2 ("xen/balloon: don't alloc
page while non-preemptible", kmap_flush_unused was moved after updating
P2M table. In that case for 32 bit PV guest we might end up with
P2M X -----> S (S is mfn of balloon scratch page)
M2P Y -----> X (Y is mfn in persistent kmap entry)
kmap_flush_unused() iterates through all the PTEs in the kmap address
space, using pte_to_page() to obtain the page. If the p2m and the m2p
are inconsistent the incorrect page is returned. This will clear
page->address on the wrong page which may cause subsequent oopses if
that page is currently kmap'ed.
Move the flush back between get_page and __set_phys_to_machine to fix
this.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>