28451 Commits

Author SHA1 Message Date
Erik Hugne
5d21cb70db tipc: allow implicit connect for stream sockets
TIPC's implied connect feature, aka piggyback connect, allows
applications to save one syscall and all SYN/SYN-ACK signalling
overhead when setting up a connection.  Until now, this has only
been supported for SEQPACKET sockets.  Here, we make it possible
to use this feature even with stream sockets.

At the connecting side, the connection is completed when the
first data message arrives from the accepting peer.  This means
that we must allow the connecting user to call blocking recv()
before the socket has reached state SS_CONNECTED.  So we must must
relax the state machine check at recv_stream(), and allow the
recv() call even if socket is in state SS_CONNECTING.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:53:00 -07:00
Ying Xue
cc79dd1ba9 tipc: change socket buffer overflow control to respect sk_rcvbuf
As per feedback from the netdev community, we change the buffer
overflow protection algorithm in receiving sockets so that it
always respects the nominal upper limit set in sk_rcvbuf.

Instead of scaling up from a small sk_rcvbuf value, which leads to
violation of the configured sk_rcvbuf limit, we now calculate the
weighted per-message limit by scaling down from a much bigger value,
still in the same field, according to the importance priority of the
received message.

To allow for administrative tunability of the socket receive buffer
size, we create a tipc_rmem sysctl variable to allow the user to
configure an even bigger value via sysctl command.  It is a size of
three (min/default/max) to be consistent with things like tcp_rmem.

By default, the value initialized in tipc_rmem[1] is equal to the
receive socket size needed by a TIPC_CRITICAL_IMPORTANCE message.
This value is also set as the default value of sk_rcvbuf.

Originally-by: Jon Maloy <jon.maloy@ericsson.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
[Ying: added sysctl variation to Jon's original patch]
Signed-off-by: Ying Xue <ying.xue@windriver.com>
[PG: don't compile sysctl.c if not config'd; add Documentation]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:53:00 -07:00
Eliezer Tamir
dafcc4380d net: add socket option for low latency polling
adds a socket option for low latency polling.
This allows overriding the global sysctl value with a per-socket one.
Unexport sysctl_net_ll_poll since for now it's not needed in modules.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:48:14 -07:00
Eliezer Tamir
89bf1b5a68 net: remove NET_LL_RX_POLL config menue
Remove NET_LL_RX_POLL from the config menu.
Change default to y.
Busy polling still needs to be enabled at run time.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:48:14 -07:00
Eliezer Tamir
9a3c71aa80 net: convert low latency sockets to sched_clock()
Use sched_clock() instead of get_cycles().
We can use sched_clock() because we don't care much about accuracy.
Remove the dependency on X86_TSC

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:48:14 -07:00
Eliezer Tamir
eb6db62282 net: change sysctl_net_ll_poll into an unsigned int
There is no reason for sysctl_net_ll_poll to be an unsigned long.
Change it into an unsigned int.
Fix the proc handler.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:48:13 -07:00
Daniel Borkmann
2e0c9e7911 net: sctp: sctp_association_init: put refs in reverse order
In case we need to bail out for whatever reason during assoc
init, we call sctp_endpoint_put() and then sock_put(), however,
we've hold both refs in reverse, non-symmetric order, so first
sctp_endpoint_hold() and then sock_hold().

Reverse this, so that in an error case we have sock_put() and then
sctp_endpoint_put(). Actually shouldn't matter too much, since both
cleanup paths do the right thing, but that way, it is more consistent
with the rest of the code.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-14 15:38:36 -07:00
Daniel Borkmann
c164b83814 net: sctp: minor: remove variable in sctp_init_sock
It's only used at this one time, so we could remove it as well.
This is valid and also makes it more explicit/obvious that in case
of error the sp->ep is NULL here, i.e. for the sctp_destroy_sock()
check that was recently added.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-14 15:38:36 -07:00
Daniel Borkmann
405426f6ca net: sctp: sctp_sf_do_prm_asoc: do SCTP_CMD_INIT_CHOOSE_TRANSPORT first
While this currently cannot trigger any NULL pointer dereference in
sctp_seq_dump_local_addrs(), better change the order of commands to
prevent a future bug to happen. Although we first add SCTP_CMD_NEW_ASOC
and then set the SCTP_CMD_INIT_CHOOSE_TRANSPORT, it is okay for now,
since this primitive is only called by sctp_connect() or sctp_sendmsg()
with sctp_assoc_add_peer() set first. However, lets do this precaution
and first set the transport and then add it to the association hashlist
to prevent in future something to possibly triggering this.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-14 15:38:36 -07:00
Daniel Borkmann
f9e42b8535 net: sctp: sideeffect: throw BUG if primary_path is NULL
This clearly states a BUG somewhere in the SCTP code as e.g. fixed once
in f28156335 ("sctp: Use correct sideffect command in duplicate cookie
handling"). If this ever happens, throw a trace in the sideeffect engine
where assocs clearly must have a primary_path assigned.

When in sctp_seq_dump_local_addrs() also throw a WARN and bail out since
we do not need to panic for printing this one asterisk. Also, it will
avoid the not so obvious case when primary != NULL test passes and at a
later point in time triggering a NULL ptr dereference caused by primary.
While at it, also fix up the white space.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-14 15:38:36 -07:00
David S. Miller
09ce069dff Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:

====================
A few miscellaneous improvements and cleanups before the GRE tunnel
integration series. Intended for net-next/3.11.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-14 15:31:22 -07:00
Pravin B Shelar
93d8fd1514 openvswitch: Simplify interface ovs_flow_metadata_from_nlattrs()
This is not functional change, this is just code cleanup.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-06-14 15:09:12 -07:00
Pravin B Shelar
b34df5e805 openvswitch: make skb->csum consistent with rest of networking stack.
Following patch keeps skb->csum correct across ovs.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-06-14 15:09:12 -07:00
Andy Hill
af7841636b openvswitch: Fix misspellings in comments and docs.
Flagged with: https://github.com/lyda/misspell-check
Run with: git ls-files | misspellings -f -

Signed-off-by: Andy Hill <hillad@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-06-14 15:09:11 -07:00
Lorand Jakab
34d94f2102 openvswitch: fix variable names in comment
Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-06-14 15:09:10 -07:00
Pravin B Shelar
91b7514cdf openvswitch: Unify vport error stats handling.
Following patch changes vport->send return type so that vport
layer can do error accounting.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-06-14 15:09:10 -07:00
Jesse Gross
cbd531bebb openvswitch: Remove unused get_config vport op.
The get_config vport op is left over from old compatibility code,
it is neither used nor implemented any more.

Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-06-14 15:09:09 -07:00
Jesse Gross
f44f340883 openvswitch: Immediately exit on error in ovs_vport_cmd_set().
It is an error to try to change the type of a vport using the set
command. However, while we check that this is an error, we still
proceed to allocate memory which then gets freed immediately.
This stops processing after noticing the error, which does not
actually fix a bug but is more correct.

Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-06-14 15:09:09 -07:00
Rony Efraim
1d8faf48c7 net/core: Add VF link state control
Add netlink directives and ndo entry to allow for controling
VF link, which can be in one of three states:

Auto - VF link state reflects the PF link state (default)

Up - VF link state is up, traffic from VF to VF works even if
the actual PF link is down

Down - VF link state is down, no traffic from/to this VF, can be of
use while configuring the VF

Signed-off-by: Rony Efraim <ronye@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 17:51:04 -07:00
Eric Dumazet
ca4ec90b31 htb: reorder struct htb_class fields for performance
htb_class structures are big, and source of false sharing on SMP.

By carefully splitting them in two parts, we can improve performance.

I got 9 % performance increase on a 24 threads machine, with 200
concurrent netperf in TCP_RR mode, using a HTB hierarchy of 4 classes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 17:17:02 -07:00
Willem de Bruijn
5f121b9a83 net-rps: fixes for rps flow limit
Caught by sparse:
- __rcu: missing annotation to sd->flow_limit
- __user: direct access in cpumask_scnprintf

Also
- add endline character when printing bitmap if room in buffer
- avoid bucket overflow by reducing FLOW_LIMIT_HISTORY

The last item warrants some explanation. The hashtable buckets are
subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal
to bucket size, since all packets may end up in a single bucket. The
current (rather arbitrary) history value of 256 happens to match the
buffer size (u8).

As a result, with a single flow, the first 128 packets are accepted
(correct), the second 128 packets dropped (correct) and then the
history[] array has filled, so that each subsequent new packet
causes an increment in the bucket for new_flow plus a decrement
for old_flow: a steady state.

This is fine if packets are dropped, as the steady state goes away
as soon as a mix of traffic reappears. But, because the 256th packet
overflowed the bucket to 0: no packets are dropped.

Instead of explicitly adding an overflow check, this patch changes
FLOW_LIMIT_HISTORY to never be able to overflow a single bucket.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
(first item)

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 17:13:05 -07:00
Yuchung Cheng
85f16525a2 tcp: properly send new data in fast recovery in first RTT
Linux sends new unset data during disorder and recovery state if all
(suspected) lost packets have been retransmitted ( RFC5681, section
3.2 step 1 & 2, RFC3517 section 4, NexSeg() Rule 2).  One requirement
is to keep the receive window about twice the estimated sender's
congestion window (tcp_rcv_space_adjust()), assuming the fast
retransmits repair the losses in the next round trip.

But currently it's not the case on the first round trip in either
normal or Fast Open connection, beucase the initial receive window
is identical to (expected) sender's initial congestion window. The
fix is to double it.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:46:29 -07:00
Joe Perches
fe2c6338fd net: Convert uses of typedef ctl_table to struct ctl_table
Reduce the uses of this unnecessary typedef.

Done via perl script:

$ git grep --name-only -w ctl_table net | \
  xargs perl -p -i -e '\
	sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
        s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'

Reflow the modified lines that now exceed 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:36:09 -07:00
Flavio Leitner
194f4a6df2 net: make all team port device link events urgent
Since team functionality relies heavily on userspace daemon, we need to
deliver event to userspace via Netlink as quick as possible. So make all
team port device link events urgent.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:31:41 -07:00
Wu Fengguang
a06a2d378d net: ping_check_bind_addr() etc. can be static
net/ipv4/ping.c:286:5: sparse: symbol 'ping_check_bind_addr' was not declared. Should it be static?
net/ipv4/ping.c:355:6: sparse: symbol 'ping_set_saddr' was not declared. Should it be static?
net/ipv4/ping.c:370:6: sparse: symbol 'ping_clear_saddr' was not declared. Should it be static?

net/ipv6/ping.c:60:5: sparse: symbol 'dummy_ipv6_recv_error' was not declared. Should it be static?
net/ipv6/ping.c:64:5: sparse: symbol 'dummy_ip6_datagram_recv_ctl' was not declared. Should it be static?
net/ipv6/ping.c:69:5: sparse: symbol 'dummy_icmpv6_err_convert' was not declared. Should it be static?
net/ipv6/ping.c:73:6: sparse: symbol 'dummy_ipv6_icmp_error' was not declared. Should it be static?
net/ipv6/ping.c:75:5: sparse: symbol 'dummy_ipv6_chk_addr' was not declared. Should it be static?
net/ipv6/ping.c:201:5: sparse: symbol 'ping_v6_seq_show' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:36:41 -07:00
Simon Horman
2c928e0e8d sctp: Correct byte order of access to skb->{network, transport}_header
Corrects an byte order conflict introduced by
158874cac61245b84e939c92c53db7000122b7b0
("sctp: Correct access to skb->{network, transport}_header").
The values in question are host byte order.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:21:50 -07:00
Gao feng
ca15febfe9 netlink: make compare exist all the time
Commit da12c90e099789a63073fc82a19542ce54d4efb9
"netlink: Add compare function for netlink_table"
only set compare at the time we create kernel netlink,
and reset compare to NULL at the time we finially
release netlink socket, but netlink_lookup wants
the compare exist always.

So we should set compare after we allocate nl_table,
and never reset it. make comapre exist all the time.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 00:45:48 -07:00
Eric Dumazet
7c0cadc69c udp: fix two sparse errors
commit ba418fa357a7b3c ("soreuseport: UDP/IPv4 implementation")
added following sparse errors :

net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:433:60:    expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:433:60:    got restricted __be16 [usertype] sport
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:514:60:    expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:514:60:    got restricted __be16 [usertype] sport
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:03:24 -07:00
Eric Dumazet
5b9b626377 gro: remove a sparse error
Fix following sparse error :

net/ipv4/af_inet.c:1410:59: warning: restricted __be16 degrades to
integer

added in commit db8caf3dbc77599
("gro: should aggregate frames without DF")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
From: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:03:24 -07:00
David S. Miller
4a2e667ac1 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:

====================
This pull request is intended for the 3.11 stream...

One big highlight is the cw1200 driver the ST-E CW1100 & CW1200
WLAN chipsets.  This one has been lingering for a while, lacking
some review comments.  Once started getting pulled into linux-next,
it got a bit more attention and a number of improvements were made
over the initial cut.  No doubt there will be more changes ahead,
but I think it is looking alright at this point.

Along with that, there is the usual flurry of updates to the mac80211
core and the iwlwifi, mwifiex, ath9k, rt2x00, wil6210, and other
drivers.  A few of the highlights are some rt2x00 refactoring/cleanup
by Gabor Juhos, some rt2800 hardware support enhancements by Stanislaw
Gruszka, some iwlwifi power management updates from Alexander Bondar,
some enhanced bcma SPROM support from Rafał Miłecki, and a variety
of other things here and there.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 14:23:41 -07:00
Eric Dumazet
c70eba7453 igmp: fix new sparse errors
Fix following sparse errors :

net/ipv4/igmp.c:1222:25: warning: cast from restricted __be32
net/ipv4/igmp.c🔢31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c🔢31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c🔢31:    got struct ip_mc_list *<noident>
net/ipv4/igmp.c:1250:31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c:1250:31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c:1250:31:    got struct ip_mc_list *<noident>
net/ipv4/igmp.c:2380:37: warning: cast from restricted __be32

These were added by commit e9897071350bd9
("igmp: hash a hash table to speedup ip_check_mc_rcu()")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 14:14:55 -07:00
Daniel Borkmann
7a6e288d27 pktgen: ipv6: numa: consolidate skb allocation to pktgen_alloc_skb
We currently allow for numa-node aware skb allocation only within the
fill_packet_ipv4() path, but not in fill_packet_ipv6(). Consolidate that
code to a common allocation helper to enable numa-node aware skb
allocation for ipv6, and use it in both paths. This also makes both
functions a bit more readable.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:47:25 -07:00
Daniel Borkmann
da5bab079f net: udp4: move GSO functions to udp_offload
Similarly to TCP offloading and UDPv6 offloading, move all related
UDPv4 functions to udp_offload.c to make things more explicit. Also,
by this, we can make those functions static.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:47:25 -07:00
Shawn Bohrer
946d3bd723 igmp: remove unnecessary in_device member zeroing
ip_mc_init_dev() is passed a freshly kzalloc'd in_device so it is
unnecessary to explicitly zero out the members.

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:41:15 -07:00
Eric Dumazet
e989707135 igmp: hash a hash table to speedup ip_check_mc_rcu()
After IP route cache removal, multicast applications using
a lot of multicast addresses hit a O(N) behavior in ip_check_mc_rcu()

Add a per in_device hash table to get faster lookup.

This hash table is created only if the number of items in mc_list is
above 4.

Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:25:23 -07:00
Eric Dumazet
64153ce0a7 net_sched: htb: do not setup default rate estimators
With a thousand htb classes, est_timer() spends ~5 million cpu cycles
and throws out cpu cache, because each htb class has a default
rate estimator (est 4sec 16sec).

Most users do not use default rate estimators, so switch htb
to not setup ones.

Add a module parameter (htb_rate_est) so that users relying
on this default rate estimator can revert the behavior.

echo 1 >/sys/module/sch_htb/parameters/htb_rate_est

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:14:21 -07:00
Eric Dumazet
130d3d68b5 net_sched: psched_ratecfg_precompute() improvements
Before allowing 64bits bytes rates, refactor
psched_ratecfg_precompute() to get better comments
and increased accuracy.

rate_bps field is renamed to rate_bytes_ps, as we only
have to worry about bytes per second.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 22:39:47 -07:00
John W. Linville
3899ba90a4 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
Conflicts:
	drivers/net/wireless/ath/ath9k/debug.c
	net/mac80211/iface.c
2013-06-11 14:48:32 -04:00
Eric Dumazet
45203a3b38 net_sched: add 64bit rate estimators
struct gnet_stats_rate_est contains u32 fields, so the bytes per second
field can wrap at 34360Mbit.

Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
and switch the kernel to use this structure natively.

This structure is dumped to user space as a new attribute :

TCA_STATS_RATE_EST64

Old tc command will now display the capped bps (to 34360Mbit), instead
of wrapped values, and updated tc command will display correct
information.

Old tc command output, after patch :

eric:~# tc -s -d qd sh dev lo
qdisc pfifo 8001: root refcnt 2 limit 1000p
 Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
 rate 34360Mbit 189696pps backlog 0b 0p requeues 0

This patch carefully reorganizes "struct Qdisc" layout to get optimal
performance on SMP.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:51:03 -07:00
Peter Pan(潘卫平)
b41abb42bf net: pass correct parameter to skb_headers_offset_update()
Since commit 1a37e412a022(net: Use 16bits for *_headers fields of struct
skbuff), skb->*_header are relative to skb->head,
so copy_skb_header() should not call skb_headers_offset_update() now,
and we should pass correct parameter to skb_headers_offset_update() in
pskb_expand_head() and skb_copy_expand().

Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:48:06 -07:00
Gao feng
da12c90e09 netlink: Add compare function for netlink_table
As we know, netlink sockets are private resource of
net namespace, they can communicate with each other
only when they in the same net namespace. this works
well until we try to add namespace support for other
subsystems which use netlink.

Don't like ipv4 and route table.., it is not suited to
make these subsytems belong to net namespace, Such as
audit and crypto subsystems,they are more suitable to
user namespace.

So we must have the ability to make the netlink sockets
in same user namespace can communicate with each other.

This patch adds a new function pointer "compare" for
netlink_table, we can decide if the netlink sockets can
communicate with each other through this netlink_table
self-defined compare function.

The behavior isn't changed if we don't provide the compare
function for netlink_table.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:39:42 -07:00
Vlad Yasevich
867a59436f bridge: Add a flag to control unicast packet flood.
Add a flag to control flood of unicast traffic.  By default, flood is
on and the bridge will flood unicast traffic if it doesn't know
the destination.  When the flag is turned off, unicast traffic
without an FDB will not be forwarded to the specified port.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:04:32 -07:00
Vlad Yasevich
9ba18891f7 bridge: Add flag to control mac learning.
Allow user to control whether mac learning is enabled on the port.
By default, mac learning is enabled.  Disabling mac learning will
cause new dynamic FDB entries to not be created for a particular port.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:04:32 -07:00
Cong Wang
30f3a40f9a net: remove last caller of skb_tail_offset() and itself
Similar to the following commits:

commit 00f97da17a0c8d656d0c9 (netpoll: fix position of network header)
commit 525cebedb32a87fa48584 (pktgen: Fix position of ip and udp header)

using skb_tail_offset() seems not correct since the offset
is based on head pointer.

With the last caller removed, skb_tail_offset() can be killed
finally.

Cc: Thomas Graf <tgraf@suug.ch>
Cc: Daniel Borkmann <dborkmann@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 22:22:23 -07:00
Eliezer Tamir
d30e383bb8 tcp: add low latency socket poll support.
Adds low latency socket poll support for TCP.
In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id
from the skb to the sk.
In tcp_recvmsg(), when there is no data in the socket we busy-poll.
This is a good example of how to add busy-poll support to more protocols.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:36 -07:00
Eliezer Tamir
a5b50476f7 udp: add low latency socket poll support
Add upport for busy-polling on UDP sockets.
In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
from the skb into the sk.
This is done at the earliest possible moment, right after we identify
which socket this skb is for.
In __skb_recv_datagram When there is no data and the user
tries to read we busy poll.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:36 -07:00
Eliezer Tamir
0602129286 net: add low latency socket poll
Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the socket code.
sysctl_net_ll_poll controls how many microseconds to poll.
Default is zero (disabled).
Individual protocol support will be added by subsequent patches.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:35 -07:00
Eliezer Tamir
af12fa6e46 net: add napi_id and hash
Adds a napi_id and a hashing mechanism to lookup a napi by id.
This will be used by subsequent patches to implement low latency
Ethernet device polling.
Based on a code sample by Eric Dumazet.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:35 -07:00
Pablo Neira Ayuso
c05cdb1b86 netlink: allow large data transfers from user-space
I can hit ENOBUFS in the sendmsg() path with a large batch that is
composed of many netlink messages. Here that limit is 8 MBytes of
skbuff data area as kmalloc does not manage to get more than that.

While discussing atomic rule-set for nftables with Patrick McHardy,
we decided to put all rule-set updates that need to be applied
atomically in one single batch to simplify the existing approach.
However, as explained above, the existing netlink code limits us
to a maximum of ~20000 rules that fit in one single batch without
hitting ENOBUFS. iptables does not have such limitation as it is
using vmalloc.

This patch adds netlink_alloc_large_skb() which is only used in
the netlink_sendmsg() path. It uses alloc_skb if the memory
requested is <= one memory page, that should be the common case
for most subsystems, else vmalloc for higher memory allocations.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 16:26:34 -07:00
Daniel Borkmann
28850dc7c7 net: tcp: move GRO/GSO functions to tcp_offload
Would be good to make things explicit and move those functions to
a new file called tcp_offload.c, thus make this similar to tcpv6_offload.c.
While moving all related functions into tcp_offload.c, we can also
make some of them static, since they are only used there. Also, add
an explicit registration function.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 14:39:05 -07:00