42497 Commits

Author SHA1 Message Date
Sowmini Varadhan
ef9e62c2e5 RDS: recv path gets the conn_path from rds_incoming for MP capable transports
Transports that are t_mp_capable should set the rds_conn_path
on which the datagram was recived in the ->i_conn_path field
of struct rds_incoming.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:42 -07:00
Sowmini Varadhan
7e8f4413d7 RDS: add t_mp_capable bit to be set by MP capable transports
The t_mp_capable bit will be used in the core rds module
to support multipathing logic when the transport supports it.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:41 -07:00
Sowmini Varadhan
0cb43965d4 RDS: split out connection specific state from rds_connection to rds_conn_path
In preparation for multipath RDS, split the rds_connection
structure into a base structure, and a per-path struct rds_conn_path.
The base structure tracks information and locks common to all
paths. The workqs for send/recv/shutdown etc are tracked per
rds_conn_path. Thus the workq callbacks now work with rds_conn_path.

This commit allows for one rds_conn_path per rds_connection, and will
be extended into multiple conn_paths in  subsequent commits.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:41 -07:00
Neal Cardwell
dcf1158b27 tcp: return sizeof tcp_dctcp_info in dctcp_get_info()
Make sure that dctcp_get_info() returns only the size of the
info->dctcp struct that it zeroes out and fills in. Previously it had
been returning the size of the enclosing tcp_cc_info union,
sizeof(*info).  There is no problem yet, but that union that may one
day be larger than struct tcp_dctcp_info, in which case the
TCP_CC_INFO code might accidentally copy uninitialized bytes from the
stack.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:46:30 -07:00
Wei Yongjun
a5e27d18fe sctp: fix error return code in sctp_init()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:45:42 -07:00
David S. Miller
d4c76c1afe RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAV16wsPSw1s6N8H32AQLGqA/+MC6/oufeV5WF/csMucx6I5wauPDQND0b
 E1bOgzceGckwY3NUh14sYo8o/JVxzkKCWrWb0bNBSdqUup6vTMxCBgkSUxvaCbN4
 RIcqu8LDB3vCrzMobMWijZDGqCpF8ZprUcvfCVj1tdrnEbK3njoYJjJ9UZN/lVPw
 Tw+ojKpAty7r/JtZUkWEBhlZZtG26Ko7t+mDbgISOGulPNnw0HSsodMCDb1iSMRS
 MZESOqQ5Kb8vhNOQ0zn+fNWB5d94yLZIxmt/siDXuetopHWuD1u8imqz+lhQ1BTU
 7tOO1PKGxnsalw7FlgNL1cNmOJi/kEKISURG82K1u3pTk2Bf9vtiEPpLnC6p0Qy2
 WwIQVJWewSCYsLiujqhXwmTGHC1m53979VtETHZoKhj/Jm8t3S38XN92Kt/P9y9g
 z1r3EBs6+WzTt4zaw5rjVBY33reeDgEo2RYusLxMOgvPsmX5KMNvpov9R04+JthC
 Qp1zWsE2FerxRJ+CNFBNl0ei4NVMsds3OHuEfgAiEedRowjGokDRQd6stVEev05S
 S2XggUvt011Fvhud0UJWH9koeVcHA85KNCK8V1aj9xH3zsQDizMYhZMhaL34wnu+
 z9XSv6DjZzJVWxC4gyQ/bsvdg34Uerv6g2BwRgqDCfh90Sa5U2n1CL8b3LwTKFU3
 3R3s0l3R5gg=
 =FXID
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160613' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Rename rxrpc source files

Here's the next part of the AF_RXRPC rewrite.  In this set I rename some of
the files in the net/rxrpc/ directory and adjust the Makefile and
ar-internal.h to reflect the changes.

The aim is twofold:

 (1) Remove the "ar-" prefix on those files that have it as it's not really
     useful, especially now that I'm building rxkad in.

 (2) To aid splitting the local, peer, connection and call handling code
     into separate files for object and event handling in future patches by
     making it easier to come up with new filenames.

There are two commits:

 (1) The first commit does a bunch of renames of .c files and alters the
     Makefile.  ar-internal.h isn't renamed at this time to avoid having to
     change the contents of the files being renamed.

 (2) The second commit changes the section label comments in ar-internal.h
     to reflect the changed filenames and reorders the file so that the
     sections are back in filename order.

The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
	rxrpc-rewrite-20160613
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:30:32 -07:00
Amir Vadai
e8eb36cd8c net/sched: flower: Return error when hw can't offload and skip_sw is set
When skip_sw is set and hardware fails to apply filter, return error to
user. This will make error propagation logic similar to the one
currently used in u32 classifier.
Also, changed code to use tc_skip_sw() utility function.

Signed-off-by: Amir Vadai <amirva@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 22:37:26 -07:00
Nicolas Dichtel
da6f1da819 ovs/gre: fix rtnl notifications on iface deletion
The function gretap_fb_dev_create() (only used by ovs) never calls
rtnl_configure_link(). The consequence is that dev->rtnl_link_state is
never set to RTNL_LINK_INITIALIZED.
During the deletion phase, the function rollback_registered_many() sends
a RTM_DELLINK only if dev->rtnl_link_state is set to RTNL_LINK_INITIALIZED.

Fixes: b2acd1dc3949 ("openvswitch: Use regular GRE net_device instead of vport")
CC: Thomas Graf <tgraf@suug.ch>
CC: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 22:21:44 -07:00
Nicolas Dichtel
106da663ff ovs/gre,geneve: fix error path when creating an iface
After ipgre_newlink()/geneve_configure() call, the netdev is registered.

Fixes: 7e059158d57b ("vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices")
CC: David Wragg <david@weave.works>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 22:21:44 -07:00
Su, Xuemin
d1e37288c9 udp reuseport: fix packet of same flow hashed to different socket
There is a corner case in which udp packets belonging to a same
flow are hashed to different socket when hslot->count changes from 10
to 11:

1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
and always passes 'daddr' to udp_ehashfn().

2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
INADDR_ANY instead of some specific addr.

That means when hslot->count changes from 10 to 11, the hash calculated by
udp_ehashfn() is also changed, and the udp packets belonging to a same
flow will be hashed to different socket.

This is easily reproduced:
1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
2) From the same host send udp packets to 127.0.0.1:40000, record the
socket index which receives the packets.
3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
same hslot as the aformentioned 10 sockets, and makes the hslot->count
change from 10 to 11.
4) From the same host send udp packets to 127.0.0.1:40000, and the socket
index which receives the packets will be different from the one received
in step 2.
This should not happen as the socket bound to 0.0.0.0:44096 should not
change the behavior of the sockets bound to 0.0.0.0:40000.

It's the same case for IPv6, and this patch also fixes that.

Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 17:23:09 -04:00
Eric Dumazet
6c0d54f189 net_sched: fix pfifo_head_drop behavior vs backlog
When the qdisc is full, we drop a packet at the head of the queue,
queue the current skb and return NET_XMIT_CN

Now we track backlog on upper qdiscs, we need to call
qdisc_tree_reduce_backlog(), even if the qlen did not change.

Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 17:17:58 -04:00
Hannes Frederic Sowa
b46d9f625b ipv4: fix checksum annotation in udp4_csum_init
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Tom Herbert <tom@herbertland.com>
Fixes: 4068579e1e098fa ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 15:28:04 -04:00
Hannes Frederic Sowa
c148d16369 ipv6: fix checksum annotation in udp6_csum_init
Cc: Tom Herbert <tom@herbertland.com>
Fixes: 4068579e1e098fa ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 15:26:42 -04:00
Hannes Frederic Sowa
5119bd1681 ipv6: tcp: fix endianness annotation in tcp_v6_send_response
Cc: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 15:25:35 -04:00
Hannes Frederic Sowa
dcb94b88c0 ipv6: fix endianness error in icmpv6_err
IPv6 ping socket error handler doesn't correctly convert the new 32 bit
mtu to host endianness before using.

Cc: Lorenzo Colitti <lorenzo@google.com>
Fixes: 6d0bfe22611602f ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 15:24:35 -04:00
David Howells
0d81a51ab9 rxrpc: Update the comments in ar-internal.h to reflect renames
Update the section comments in ar-internal.h that indicate the locations of
the referenced items to reflect the renames done to the .c files in
net/rxrpc/.

This also involves some rearrangement to reflect keep the sections in order
of filename.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-13 13:38:51 +01:00
David Howells
8c3e34a4ff rxrpc: Rename files matching ar-*.c to git rid of the "ar-" prefix
Rename files matching net/rxrpc/ar-*.c to get rid of the "ar-" prefix.
This will aid splitting those files by making easier to come up with new
names.

Note that the not all files are simply renamed from ar-X.c to X.c.  The
following exceptions are made:

 (*) ar-call.c -> call_object.c
     ar-ack.c -> call_event.c

     call_object.c is going to contain the core of the call object
     handling.  Call event handling is all going to be in call_event.c.

 (*) ar-accept.c -> call_accept.c

     Incoming call handling is going to be here.

 (*) ar-connection.c -> conn_object.c
     ar-connevent.c -> conn_event.c

     The former file is going to have the basic connection object handling,
     but there will likely be some differentiation between client
     connections and service connections in additional files later.  The
     latter file will have all the connection-level event handling.

 (*) ar-local.c -> local_object.c

     This will have the local endpoint object handling code.  The local
     endpoint event handling code will later be split out into
     local_event.c.

 (*) ar-peer.c -> peer_object.c

     This will have the peer endpoint object handling code.  Peer event
     handling code will be placed in peer_event.c (for the moment, there is
     none).

 (*) ar-error.c -> peer_event.c

     This will become the peer event handling code, though for the moment
     it's actually driven from the local endpoint's perspective.

Note that I haven't renamed ar-transport.c to transport_object.c as the
intention is to delete it when the rxrpc_transport struct is excised.

The only file that actually has its contents changed is net/rxrpc/Makefile.

net/rxrpc/ar-internal.h will need its section marker comments updating, but
I'll do that in a separate patch to make it easier for git to follow the
history across the rename.  I may also want to rename ar-internal.h at some
point - but that would mean updating all the #includes and I'd rather do
that in a separate step.

Signed-off-by: David Howells <dhowells@redhat.com.
2016-06-13 12:16:05 +01:00
Florian Westphal
99860208bc sched: remove NET_XMIT_POLICED
sch_atm returns this when TC_ACT_SHOT classification occurs.

But all other schedulers that use tc_classify
(htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS
in this case so just do that in atm.

BATMAN uses it as an intermediate return value to signal
forwarding vs. buffering, but it did not return POLICED to
callers outside of BATMAN.

Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-12 22:02:11 -04:00
Eric Dumazet
cbdf451164 net_sched: prio: properly report out of memory errors
At Qdisc creation or change time, prio_tune() creates missing
pfifo qdiscs but does not return an error code if one
qdisc could not be allocated.

Leaving a qdisc in non operational state without telling user
anything about this problem is not good.

Also, testing if we replace something different than noop_qdisc
a second time makes no sense so I removed useless code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-12 21:56:38 -04:00
Miklos Szeredi
30402c8949 Merge branch 'overlayfs-af_unix-fix' into overlayfs-linus 2016-06-12 12:05:21 +02:00
David S. Miller
86ef7f9cbf ipconfig: Protect ic_addrservaddr with IPCONFIG_DYNAMIC.
>> net/ipv4/ipconfig.c:130:15: warning: 'ic_addrservaddr' defined but not used [-Wunused-variable]
    static __be32 ic_addrservaddr = NONE; /* IP Address of the IP addresses'server */

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-11 20:40:24 -07:00
Hannes Frederic Sowa
38b7097b55 ipv6: use TOS marks from sockets for routing decision
In IPv6 the ToS values are part of the flowlabel in flowi6 and get
extracted during fib rule lookup, but we forgot to correctly initialize
the flowlabel before the routing lookup.

Reported-by: <liam.mcbirnie@boeing.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-11 15:33:26 -07:00
Eric Dumazet
45f50bed1d net_sched: remove generic throttled management
__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.

I already removed it for sch_fq in commit f2600cf02b5b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.

When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.

Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.

This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:21 -07:00
Eric Dumazet
42117927ca net_sched: netem: remove qdisc_is_throttled() use
Looks like it is only there as some optimization attempt.

Since __QDISC_STATE_THROTTLED set/unset is way too expensive,
and netem is the last user, just remove this check.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:21 -07:00
Eric Dumazet
cca605dd4b net_sched: cbq: remove a flaky use of qdisc_is_throttled()
So far no qdisc ever unset the throttled bit at enqueue() time,
so CBQ usage of qdisc_is_throttled() was flaky.

Since __QDISC_STATE_THROTTLED set/unset is way too expensive
considering that only CBQ was eventually caring for this status,
it would make sense to implement a Qdisc ops ->is_throttled()
if we find that this is needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:20 -07:00
Eric Dumazet
8fe6a79fb8 net_sched: sch_plug: use a private throttled status
We want to get rid of generic qdisc throttled management,
so this qdisc has to use a private flag.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:20 -07:00
Ben Dooks
0b392be9a8 net: ipconfig: avoid warning by making ic_addrservaddr static
The symbol ic_addrservaddr is not static, but has no declaration
to match so make it static to fix the following warning:

net/ipv4/ipconfig.c:130:8: warning: symbol 'ic_addrservaddr' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:28:15 -07:00
Ben Dooks
c3ec5e5ce9 net: diag: add missing declarations
The functions inet_diag_msg_common_fill and inet_diag_msg_attrs_fill
seem to have been missed from the include/linux/inet_diag.h header
file. Add them to fix the following warnings:

net/ipv4/inet_diag.c:69:6: warning: symbol 'inet_diag_msg_common_fill' was not declared. Should it be static?
net/ipv4/inet_diag.c:108:5: warning: symbol 'inet_diag_msg_attrs_fill' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:22:55 -07:00
Xin Long
d46e416c11 sctp: sctp should change socket state when shutdown is received
Now sctp doesn't change socket state upon shutdown reception. It changes
just the assoc state, even though it's a TCP-style socket.

For some cases, if we really need to check sk->sk_state, it's necessary to
fix this issue, at least when we use ss or netstat to dump, we can get a
more exact information.

As an improvement, we will change sk->sk_state when we change asoc->state
to SHUTDOWN_RECEIVED, and also do it in sctp_shutdown to keep consistent
with sctp_close.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:21:23 -07:00
David S. Miller
d6cf3a85b4 For the next cycle, we have the following:
* the biggest change is Michał's work on integrating FQ/codel
    with the mac80211 internal software queues
  * cfg80211 connect result gets clarified for the
    "no connection at all" case
  * advertisement of per-interface type capabilities, in case
    they differ (which makes a lot of sense for some capabilities)
  * most of the nl80211 & hwsim unprivileged namespace operation
    changes
  * human-readable VHT capabilities in debugfs
  * some other cleanups, like spelling
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXWWe/AAoJEGt7eEactAAdXVsQAKGkdXGmUU14tfiRCnZryMEH
 GyiVDHQivfcadicbK9599LXadvYc6ATZHfYoSnROwB82NfhKB71N8dUJ4qePxLNa
 lDe5uuuAgxtHG63hK1R02flPorkBAEcUcdHsFSuOw7JWag4/49sCqWH8X5K8E9aT
 vrhaPeppJPydwnmNzNOBvMsLEJFJdZWaEZupZQ0kiJ/jB/howFdzF75GxGQ2jh+F
 dPk22/PtV2igCtntqaty7h057AdZ+znQuiUdVB7eYIOle7veeGyMzFFf0xQ99LAZ
 +xcA7GA74u6m8O6SUVw+6nhrUJ5XTsKGUtmKCTVOcUGa5z7XDD7NIxk2SgloCJkC
 qPqVU8wyBCxKc+6JsyiVSLcB5MWvWxifvo0OyLsbCqhN50bTtJT96ymKOAx4NeSG
 s+TYlV2HgVRNN6PzGvl/ZUo8Mm1UFGWlCBPcyMy9Fwc0jxdhnOAZzOtqV1yVwlCz
 M43RKwxBX6MHAtlfwy6g3M5ievAviwY3Kt+yGeLRIVsJJfOUdQxYAa+m7GTeohD/
 Tns4IHbtWJiLTKuaTvrs4ec3ycXrv7iibMINcarvkTgbbF9Qvarf/e2RoWO/freY
 OUixDdG7HmsY0XIUhSepeMJpn5xIhQ7dkmckhzDEgZqBM6uPJEMB2+O/CGNDsxEp
 VjfnwXJ+9PQrjheBvnD2
 =FvhQ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
For the next cycle, we have the following:
 * the biggest change is Michał's work on integrating FQ/codel
   with the mac80211 internal software queues
 * cfg80211 connect result gets clarified for the
   "no connection at all" case
 * advertisement of per-interface type capabilities, in case
   they differ (which makes a lot of sense for some capabilities)
 * most of the nl80211 & hwsim unprivileged namespace operation
   changes
 * human-readable VHT capabilities in debugfs
 * some other cleanups, like spelling
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:13:32 -07:00
Lawrence Brakmo
699fafafab tcp: add NV congestion control
TCP-NV (New Vegas) is a major update to TCP-Vegas.
An earlier version of NV was presented at 2010's LPC.
It is a delayed based congestion avoidance for the
data center. This version has been tested within a
10G rack where the HW RTTs are 20-50us and with
1 to 400 flows.

A description of TCP-NV, including implementation
details as well as experimental results, can be found at:
http://www.brakmo.org/networking/tcp-nv/TCPNV.html

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:07:49 -07:00
Lawrence Brakmo
6f094b9ec6 tcp: add in_flight to tcp_skb_cb
Add in_flight (bytes in flight when packet was sent) field
to tx component of tcp_skb_cb and make it available to
congestion modules' pkts_acked() function through the
ack_sample function argument.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:07:49 -07:00
Mike Rapoport
1276f24eee packet: use common code for virtio_net_hdr and skb GSO conversion
Replace open coded conversion between virtio_net_hdr to skb GSO info with
virtio_net_hdr_from_skb

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:03:56 -07:00
Bhaktipriya Shridhar
231edca97f RDS: IB: Remove deprecated create_workqueue
alloc_workqueue replaces deprecated create_workqueue().

Since the driver is infiniband which can be used as block device and the
workqueue seems involved in regular operation of the device, so a
dedicated workqueue has been used  with WQ_MEM_RECLAIM set to guarantee
forward progress under memory pressure.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 22:52:28 -07:00
Ido Schimmel
56fae404fb bridge: Fix incorrect re-injection of STP packets
Commit 8626c56c8279 ("bridge: fix potential use-after-free when hook
returns QUEUE or STOLEN verdict") fixed incorrect usage of NF_HOOK's
return value by consuming packets in okfn via br_pass_frame_up().

However, this function re-injects packets to the Rx path with skb->dev
set to the bridge device, which breaks kernel's STP, as all STP packets
appear to originate from the bridge device itself.

Instead, if STP is enabled and bridge isn't a 802.1ad bridge, then learn
packet's SMAC and inject it back to the Rx path for further processing
by the packet handlers.

The patch also makes netfilter's behavior consistent with regards to
packets destined to the Bridge Group Address, as no hook registered at
LOCAL_IN will ever be called, regardless if STP is enabled or not.

Cc: Florian Westphal <fw@strlen.de>
Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Fixes: 8626c56c8279 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 22:41:58 -07:00
David Howells
0e119b41b7 rxrpc: Limit the listening backlog
Limit the socket incoming call backlog queue size so that a remote client
can't pump in sufficient new calls that the server runs out of memory.  Note
that this is partially theoretical at the moment since whilst the number of
calls is limited, the number of packets trying to set up new calls is not.
This will be addressed in a later patch.

If the caller of listen() specifies a backlog INT_MAX, then they get the
current maximum; anything else greater than max_backlog or anything
negative incurs EINVAL.

The limit on the maximum queue size can be set by:

	echo N >/proc/sys/net/rxrpc/max_backlog

where 4<=N<=32.

Further, set the default backlog to 0, requiring listen() to be called
before we start actually queueing new calls.  Whilst this kind of is a
change in the UAPI, the caller can't actually *accept* new calls anyway
unless they've first called listen() to put the socket into the LISTENING
state - thus the aforementioned new calls would otherwise just sit there,
eating up kernel memory.  (Note that sockets that don't have a non-zero
service ID bound don't get incoming calls anyway.)

Given that the default backlog is now 0, make the AFS filesystem call
kernel_listen() to set the maximum backlog for itself.

Possible improvements include:

 (1) Trimming a too-large backlog to max_backlog when listen is called.

 (2) Trimming the backlog value whenever the value is used so that changes
     to max_backlog are applied to an open socket automatically.  Note that
     the AFS filesystem opens one socket and keeps it open for extended
     periods, so would miss out on changes to max_backlog.

 (3) Having a separate setting for the AFS filesystem.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 18:14:47 -07:00
David Howells
bc6e1ea32c rxrpc: Trim line-terminal whitespace
Trim line-terminal whitespace in net/rxrpc/

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 18:14:47 -07:00
Daniel Borkmann
ea7f8277f9 net, cls: allow for deleting all filters for given parent
Add a possibility where the user can just specify the parent and
all filters under that parent are then being purged. Currently,
for example for scripting, one needs to specify pref/prio to have
a well-defined number for 'tc filter del' command for addressing
the previously created instance or additionally filter handle in
case of priorities being the same. Improve usage by allowing the
option for tc to specify the parent and removing the whole chain
for that given parent.

Example usage after patch, no tc changes required:

  # tc qdisc replace dev foo clsact
  # tc filter add dev foo egress bpf da obj ./bpf.o
  # tc filter add dev foo egress bpf da obj ./bpf.o
  # tc filter show dev foo egress
  filter protocol all pref 49151 bpf
  filter protocol all pref 49151 bpf handle 0x1 bpf.o:[classifier] direct-action
  filter protocol all pref 49152 bpf
  filter protocol all pref 49152 bpf handle 0x1 bpf.o:[classifier] direct-action
  # tc filter del dev foo egress
  # tc filter show dev foo egress
  #

Previously, RTM_DELTFILTER requests with invalid prio of 0 were
rejected, so only netlink requests with RTM_NEWTFILTER and NLM_F_CREATE
flag were allowed where the kernel would auto-generate a pref/prio.
We can piggyback on that and use prio of 0 as a wildcard for
requests of RTM_DELTFILTER.

For notifying tc netlink monitoring users (e.g. libnl uses this
for caching), there are two options, that is, sending individual
tfilter_notify() notifications for each tcf_proto, or sending a
single one indicating wildcard removal. I tried both and there
are pros and cons for each, eventually I decided for sending
individual tfilter_notify(), so that user space can support this
seamlessly and there won't be a mess of changing each and every
application to make sure expectations from the kernel won't break
when they don't understand single notification. Since linear chains
don't really scale, I expect only a handful of classifiers to be
attached at max for a given parent anyway.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 18:11:01 -07:00
Daniel Borkmann
f7bd9e36ee bpf: reject wrong sized filters earlier
Add a bpf_check_basics_ok() and reject filters that are of invalid
size much earlier, so we don't do any useless work such as invoking
bpf_prog_alloc(). Currently, rejection happens in bpf_check_classic()
only, but it's really unnecessarily late and they should be rejected
at earliest point. While at it, also clean up one bpf_prog_size() to
make it consistent with the remaining invocations.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 18:00:57 -07:00
Daniel Borkmann
a70b506efe bpf: enforce recursion limit on redirects
Respect the stack's xmit_recursion limit for calls into dev_queue_xmit().
Currently, they are not handeled by the limiter when attached to clsact's
egress parent, for example, and a buggy program redirecting it to the
same device again could run into stack overflow eventually. It would be
good if we could notify an admin to give him a chance to react. We reuse
xmit_recursion instead of having one private to eBPF, so that the stack's
current recursion depth will be taken into account as well. Follow-up to
commit 3896d655f4d4 ("bpf: introduce bpf_clone_redirect() helper") and
27b29f63058d ("bpf: add bpf_redirect() helper").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 18:00:57 -07:00
William Tu
f2a4d086ed openvswitch: Add packet truncation support.
The patch adds a new OVS action, OVS_ACTION_ATTR_TRUNC, in order to
truncate packets. A 'max_len' is added for setting up the maximum
packet size, and a 'cutlen' field is to record the number of bytes
to trim the packet when the packet is outputting to a port, or when
the packet is sent to userspace.

Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 17:58:03 -07:00
David S. Miller
1578b0a5e9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/sched/act_police.c
	net/sched/sch_drr.c
	net/sched/sch_hfsc.c
	net/sched/sch_prio.c
	net/sched/sch_red.c
	net/sched/sch_tbf.c

In net-next the drop methods of the packet schedulers got removed, so
the bug fixes to them in 'net' are irrelevant.

A packet action unload crash fix conflicts with the addition of the
new firstuse timestamp.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 11:52:24 -07:00
Linus Torvalds
698ea54dde Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) nfnetlink timestamp taken from wrong skb, fix from Florian Westphal.

 2) Revert some msleep conversions in rtlwifi as these spots are in
    atomic context, from Larry Finger.

 3) Validate that NFTA_SET_TABLE attribute is actually specified when we
    call nf_tables_getset().  From Phil Turnbull.

 4) Don't do mdio_reset in stmmac driver with spinlock held as that can
    sleep, from Vincent Palatin.

 5) sk_filter() does things other than run a BPF filter, so we should
    not elide it's call just because sk->sk_filter is NULL.  Fix from
    Eric Dumazet.

 6) Fix missing backlog updates in several packet schedulers, from Cong
    Wang.

 7) bnx2x driver should allow VLAN add/remove while the interface is
    down, from Michal Schmidt.

 8) Several RDS/TCP race fixes from Sowmini Varadhan.

 9) fq_codel scheduler doesn't return correct queue length in dumps,
    from Eric Dumazet.

10) Fix TCP stats for tail loss probe and early retransmit in ipv6, from
    Yuchung Cheng.

11) Properly initialize udp_tunnel_socket_cfg in l2tp_tunnel_create(),
    from Guillaume Nault.

12) qfq scheduler leaks SKBs if a kzalloc fails, fix from Florian
    Westphal.

13) sock_fprog passed into PACKET_FANOUT_DATA needs compat handling,
    from Willem de Bruijn.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
  vmxnet3: segCnt can be 1 for LRO packets
  packet: compat support for sock_fprog
  stmmac: fix parameter to dwmac4_set_umac_addr()
  net/mlx5e: Fix blue flame quota logic
  net/mlx5e: Use ndo_stop explicitly at shutdown flow
  net/mlx5: E-Switch, always set mc_promisc for allmulti vports
  net/mlx5: E-Switch, Modify node guid on vf set MAC
  net/mlx5: E-Switch, Fix vport enable flow
  net/mlx5: E-Switch, Use the correct error check on returned pointers
  net/mlx5: E-Switch, Use the correct free() function
  net/mlx5: Fix E-Switch flow steering capabilities check
  net/mlx5: Fix flow steering NIC capabilities check
  net/mlx5: Fix root flow table update
  net/mlx5: Fix MLX5_CMD_OP_MAX to be defined correctly
  net/mlx5: Fix masking of reserved bits in XRCD number
  net/mlx5: Fix the size of modify QP mailbox
  mlxsw: spectrum: Don't sleep during ndo_get_phys_port_name()
  mlxsw: spectrum: Make split flow match firmware requirements
  wext: Fix 32 bit iwpriv compatibility issue with 64 bit Kernel
  cfg80211: remove get/set antenna and tx power warnings
  ...
2016-06-10 08:32:24 -07:00
Willem de Bruijn
719c44d340 packet: compat support for sock_fprog
Socket option PACKET_FANOUT_DATA takes a struct sock_fprog as argument
if PACKET_FANOUT has mode PACKET_FANOUT_CBPF. This structure contains
a pointer into user memory. If userland is 32-bit and kernel is 64-bit
the two disagree about the layout of struct sock_fprog.

Add compat setsockopt support to convert a 32-bit compat_sock_fprog to
a 64-bit sock_fprog. This is analogous to compat_sock_fprog support for
SO_REUSEPORT added in commit 1957598840f4 ("soreuseport: add compat
case for setsockopt SO_ATTACH_REUSEPORT_CBPF").

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 23:41:03 -07:00
David Ahern
e434863718 net: vrf: Fix crash when IPv6 is disabled at boot time
Frank Kellermann reported a kernel crash with 4.5.0 when IPv6 is
disabled at boot using the kernel option ipv6.disable=1. Using
current net-next with the boot option:

$ ip link add red type vrf table 1001

Generates:
[12210.919584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000748
[12210.921341] IP: [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a
[12210.922537] PGD b79e3067 PUD bb32b067 PMD 0
[12210.923479] Oops: 0000 [#1] SMP
[12210.924001] Modules linked in: ipvlan 8021q garp mrp stp llc
[12210.925130] CPU: 3 PID: 1177 Comm: ip Not tainted 4.7.0-rc1+ #235
[12210.926168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[12210.928065] task: ffff8800b9ac4640 ti: ffff8800bacac000 task.ti: ffff8800bacac000
[12210.929328] RIP: 0010:[<ffffffff814b30e3>]  [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a
[12210.930697] RSP: 0018:ffff8800bacaf888  EFLAGS: 00010202
[12210.931563] RAX: 0000000000000748 RBX: ffffffff81a9e280 RCX: ffff8800b9ac4e28
[12210.932688] RDX: 00000000000000e9 RSI: 0000000000000002 RDI: 0000000000000286
[12210.933820] RBP: ffff8800bacaf898 R08: ffff8800b9ac4df0 R09: 000000000052001b
[12210.934941] R10: 00000000657c0000 R11: 000000000000c649 R12: 00000000000003e9
[12210.936032] R13: 00000000000003e9 R14: ffff8800bace7800 R15: ffff8800bb3ec000
[12210.937103] FS:  00007faa1766c700(0000) GS:ffff88013ac00000(0000) knlGS:0000000000000000
[12210.938321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12210.939166] CR2: 0000000000000748 CR3: 00000000b79d6000 CR4: 00000000000406e0
[12210.940278] Stack:
[12210.940603]  ffff8800bb3ec000 ffffffff81a9e280 ffff8800bacaf8c8 ffffffff814b3135
[12210.941818]  ffff8800bb3ec000 ffffffff81a9e280 ffffffff81a9e280 ffff8800bace7800
[12210.943040]  ffff8800bacaf8f0 ffffffff81397c88 ffff8800bb3ec000 ffffffff81a9e280
[12210.944288] Call Trace:
[12210.944688]  [<ffffffff814b3135>] fib6_new_table+0x24/0x8a
[12210.945516]  [<ffffffff81397c88>] vrf_dev_init+0xd4/0x162
[12210.946328]  [<ffffffff814091e1>] register_netdevice+0x100/0x396
[12210.947209]  [<ffffffff8139823d>] vrf_newlink+0x40/0xb3
[12210.948001]  [<ffffffff814187f0>] rtnl_newlink+0x5d3/0x6d5
...

The problem above is due to the fact that the fib hash table is not
allocated when IPv6 is disabled at boot.

As for the VRF driver it should not do any IPv6 initializations if IPv6
is disabled, so it needs to know if IPv6 is disabled at boot. The disable
parameter is private to the IPv6 module, so provide an accessor for
modules to determine if IPv6 was disabled at boot time.

Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 23:34:42 -07:00
David Howells
2341e07757 rxrpc: Simplify connect() implementation and simplify sendmsg() op
Simplify the RxRPC connect() implementation.  It will just note the
destination address it is given, and if a sendmsg() comes along with no
address, this will be assigned as the address.  No transport struct will be
held internally, which will allow us to remove this later.

Simplify sendmsg() also.  Whilst a call is active, userspace refers to it
by a private unique user ID specified in a control message.  When sendmsg()
sees a user ID that doesn't map to an extant call, it creates a new call
for that user ID and attempts to add it.  If, when we try to add it, the
user ID is now registered, we now reject the message with -EEXIST.  We
should never see this situation unless two threads are racing, trying to
create a call with the same ID - which would be an error.

It also isn't required to provide sendmsg() with an address - provided the
control message data holds a user ID that maps to a currently active call.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 23:30:12 -07:00
Fabien Siron
21aff3b905 net/netlink/af_netlink.h: Remove unused structure.
Signed-off-by: Fabien Siron <fabien.siron@epita.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 22:26:24 -07:00
Eric Dumazet
d3fff6c443 net: add netdev_lockdep_set_classes() helper
It is time to add netdev_lockdep_set_classes() helper
so that lockdep annotations per device type are easier to manage.

This removes a lot of copies and missing annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 13:28:37 -07:00
Eric Dumazet
52fbb29079 net: sched: fix qdisc->running lockdep annotations
1) qdisc_run_begin() is really using the equivalent of a trylock.
  Instead of using write_seqcount_begin(), use a combination of
  raw_write_seqcount_begin() and correct lockdep annotation.

2) sch_direct_xmit() should use regular spin_lock(root_lock)

Fixes: f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 13:28:37 -07:00
David S. Miller
e71ba91e48 Two more fixes for now:
* a fix for a long-standing iwpriv 32/64 compat issue
  * two fairly recently introduced (4.6) warning asking for
    symmetric operations are erroneous and I remove them
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXWWVNAAoJEGt7eEactAAd2HMP/RGyAWN/uXJVNySsqvBSqiew
 t5ZTopRfL2YEcO1EUF749bncfJo6P9KUtuM72gYYRYART5viucdYFk53Ky8EtKh8
 QQ7WTGyiuQP7XvderfXmrjV/J35KFiPTnf83KUfOnBTVOCtw6doIBIMutUmR7FPV
 cyH4e56j9agzbBJ7mjeLYGBMmyhq2uZ85vXqk/mnpn5fH9+lGojcoIVZfhXWmMoI
 d5O/giARzjDM7LTMIwgepY5w/j8RX1rMJCXfxiZsGMuiSTYH8FETbe3Go9CL/qHS
 8cZCCQRMCpjMK+SOmNr7xYx70YjIJQabwo9luYmcdCkhw6VHRxHl47lk/2XLp4F9
 G+ReA1A7izV2XSyJIWTl+Or1qzgzsf6SfZHKszSdY47NXzK5aCprRfUvXyxLv6GZ
 IGcHzNWOFF/1fKhh3w2NylhTd/omkvtZzt5Xuom94OPXdyxzy/PmViZ+xuhutrGt
 dYxhmzPZcw+/OeTGgbt2YfYFUdMZqbHkaHPE0wia03r1yXzQD/x05/QET2vwIA0L
 C+ATyaykSCCOm/ovwm5rada151Iai1engCWIVw/pZ3HUj6C8xqwyDS3+xsTRmKzI
 oBp1O3K3yR0c0T9KJNmORnnrYQAbSBIQNhdioXVCBpByTOjYfATKVQtaSIV/uu4/
 2r1KQmrT4z5OrL6lm2wv
 =0I3T
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Two more fixes for now:
 * a fix for a long-standing iwpriv 32/64 compat issue
 * two fairly recently introduced (4.6) warning asking for
   symmetric operations are erroneous and I remove them
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 11:52:47 -07:00