3398 Commits

Author SHA1 Message Date
Nicolas Dichtel
e837735ec4 ip6_tunnel: ensure to always have a link local address
When an Xin6 tunnel is set up, we check other netdevices to inherit the link-
local address. If none is available, the interface will not have any link-local
address. RFC4862 expects that each interface has a link local address.

Now than this kind of tunnels supports x-netns, it's easy to fall in this case
(by creating the tunnel in a netns where ethernet interfaces stand and then
moving it to a other netns where no ethernet interface is available).

RFC4291, Appendix A suggests two methods: the first is the one currently
implemented, the second is to generate a unique identifier, so that we can
always generate the link-local address. Let's use eth_random_addr() to generate
this interface indentifier.

I remove completly the previous method, hence for the whole life of the
interface, the link-local address remains the same (previously, it depends on
which ethernet interfaces were up when the tunnel interface was set up).

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 23:45:42 -07:00
David S. Miller
7eaa48a45c Revert "ipv6: fix checkpatch errors in net/ipv6/addrconf.c"
This reverts commit df8372ca747f6da9e8590775721d9363c1dfc87e.

These changes are buggy and make unintended semantic changes
to ip6_tnl_add_linklocal().

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 23:44:39 -07:00
dingtianhong
df8372ca74 ipv6: fix checkpatch errors in net/ipv6/addrconf.c
ERROR: code indent should use tabs where possible: fix 2.
ERROR: do not use assignment in if condition: fix 5.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 15:05:03 -07:00
dingtianhong
ba3542e15c ipv6: convert the uses of ADBG and remove the superfluous parentheses
Just follow the Joe Perches's opinion, it is a better way to fix the
style errors.

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 15:05:03 -07:00
David S. Miller
89d5e23210 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Conflicts:
	net/netfilter/nf_conntrack_proto_tcp.c

The conflict had to do with overlapping changes dealing with
fixing the use of an "s32" to hold the value returned by
NAT_OFFSET().

Pablo Neira Ayuso says:

====================
The following batch contains Netfilter/IPVS updates for your net-next tree.
More specifically, they are:

* Trivial typo fix in xt_addrtype, from Phil Oester.

* Remove net_ratelimit in the conntrack logging for consistency with other
  logging subsystem, from Patrick McHardy.

* Remove unneeded includes from the recently added xt_connlabel support, from
  Florian Westphal.

* Allow to update conntracks via nfqueue, don't need NFQA_CFG_F_CONNTRACK for
  this, from Florian Westphal.

* Remove tproxy core, now that we have socket early demux, from Florian
  Westphal.

* A couple of patches to refactor conntrack event reporting to save a good
  bunch of lines, from Florian Westphal.

* Fix missing locking in NAT sequence adjustment, it did not manifested in
  any known bug so far, from Patrick McHardy.

* Change sequence number adjustment variable to 32 bits, to delay the
  possible early overflow in long standing connections, also from Patrick.

* Comestic cleanups for IPVS, from Dragos Foianu.

* Fix possible null dereference in IPVS in the SH scheduler, from Daniel
  Borkmann.

* Allow to attach conntrack expectations via nfqueue. Before this patch, you
  had to use ctnetlink instead, thus, we save the conntrack lookup.

* Export xt_rpfilter and xt_HMARK header files, from Nicolas Dichtel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 13:30:54 -07:00
David S. Miller
2ff1cf12c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
Francesco Fusco
d14c5ab6be net: proc_fs: trivial: print UIDs as unsigned int
UIDs are printed in the proc_fs as signed int, whereas
they are unsigned int.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 14:37:46 -07:00
Nicolas Dichtel
0bd8762824 ip6tnl: add x-netns support
This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 01:00:20 -07:00
Nicolas Dichtel
fc8f999daa ipv4 tunnels: use net_eq() helper to check netns
It's better to use available helpers for these tests.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 01:00:20 -07:00
Hannes Frederic Sowa
fc4eba58b4 ipv6: make unsolicited report intervals configurable for mld
Commit cab70040dfd95ee32144f02fade64f0cb94f31a0 ("net: igmp:
Reduce Unsolicited report interval to 1s when using IGMPv3") and
2690048c01f32bf45d1c1e1ab3079bc10ad2aea7 ("net: igmp: Allow user-space
configuration of igmp unsolicited report interval") by William Manley made
igmp unsolicited report intervals configurable per interface and corrected
the interval of unsolicited igmpv3 report messages resendings to 1s.

Same needs to be done for IPv6:

MLDv1 (RFC2710 7.10.): 10 seconds
MLDv2 (RFC3810 9.11.): 1 second

Both intervals are configurable via new procfs knobs
mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval.

(also added .force_mld_version to ipv6_devconf_dflt to bring structs in
line without semantic changes)

v2:
a) Joined documentation update for IPv4 and IPv6 MLD/IGMP
   unsolicited_report_interval procfs knobs.
b) incorporate stylistic feedback from William Manley

v3:
a) add new DEVCONF_* values to the end of the enum (thanks to David
   Miller)

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: William Manley <william.manley@youview.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13 17:05:04 -07:00
Florian Westphal
c655bc6896 netfilter: nf_conntrack: don't send destroy events from iterator
Let nf_ct_delete handle delivery of the DESTROY event.

Based on earlier patch from Pablo Neira.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-09 12:03:33 +02:00
Eric Dumazet
1f07d03e20 net: add SNMP counters tracking incoming ECN bits
With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
counters do not count the number of frames, but number of aggregated
segments.

Its probably too late to change this now.

This patch adds four new counters, tracking number of frames, regardless
of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.

Ip[6]NoECTPkts : Number of packets received with NOECT
Ip[6]ECT1Pkts  : Number of packets received with ECT(1)
Ip[6]ECT0Pkts  : Number of packets received with ECT(0)
Ip[6]CEPkts    : Number of packets received with Congestion Experienced

lph37:~# nstat | egrep "Pkts|InReceive"
IpInReceives                    1634137            0.0
Ip6InReceives                   3714107            0.0
Ip6InNoECTPkts                  19205              0.0
Ip6InECT0Pkts                   52651828           0.0
IpExtInNoECTPkts                33630              0.0
IpExtInECT0Pkts                 15581379           0.0
IpExtInCEPkts                   6                  0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-08 22:24:59 -07:00
Hannes Frederic Sowa
3e3be27585 ipv6: don't stop backtracking in fib6_lookup_1 if subtree does not match
In case a subtree did not match we currently stop backtracking and return
NULL (root table from fib_lookup). This could yield in invalid routing
table lookups when using subtrees.

Instead continue to backtrack until a valid subtree or node is found
and return this match.

Also remove unneeded NULL check.

Reported-by: Teco Boot <teco@inf-net.nl>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David Lamparter <equinox@diac24.net>
Cc: <boutier@pps.univ-paris-diderot.fr>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07 17:17:19 -07:00
Daniel Borkmann
7921895a5e net: esp{4,6}: fix potential MTU calculation overflows
Commit 91657eafb ("xfrm: take net hdr len into account for esp payload
size calculation") introduced a possible interger overflow in
esp{4,6}_get_mtu() handlers in case of x->props.mode equals
XFRM_MODE_TUNNEL. Thus, the following expression will overflow

  unsigned int net_adj;
  ...
  <case ipv{4,6} XFRM_MODE_TUNNEL>
         net_adj = 0;
  ...
  return ((mtu - x->props.header_len - crypto_aead_authsize(esp->aead) -
           net_adj) & ~(align - 1)) + (net_adj - 2);

where (net_adj - 2) would be evaluated as <foo> + (0 - 2) in an unsigned
context. Fix it by simply removing brackets as those operations here
do not need to have special precedence.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Benjamin Poirier <bpoirier@suse.de>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05 12:26:50 -07:00
David S. Miller
0e76a3a587 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge net into net-next to setup some infrastructure Eric
Dumazet needs for usbnet changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 21:36:46 -07:00
Stefan Tomanek
73f5698e77 fib_rules: fix suppressor names and default values
This change brings the suppressor attribute names into line; it also changes
the data types to provide a more consistent interface.

While -1 indicates that the suppressor is not enabled, values >= 0 for
suppress_prefixlen or suppress_ifgroup  reject routing decisions violating the
constraint.

This changes the previously presented behaviour of suppress_prefixlen, where a
prefix length _less_ than the attribute value was rejected. After this change,
a prefix length less than *or* equal to the value is considered a violation of
the rule constraint.

It also changes the default values for default and newly added rules (disabling
any suppression for those).

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 10:40:23 -07:00
Stefan Tomanek
6ef94cfafb fib_rules: add route suppression based on ifgroup
This change adds the ability to suppress a routing decision based upon the
interface group the selected interface belongs to. This allows it to
exclude specific devices from a routing decision.

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 15:24:22 -07:00
Werner Almesberger
d1c53c8e87 icmpv6_filter: allow ICMPv6 messages with bodies < 4 bytes
By using sizeof(_hdr), net/ipv6/raw.c:icmpv6_filter implicitly assumes
that any valid ICMPv6 message is at least eight bytes long, i.e., that
the message body is at least four bytes.

The DIS message of RPL (RFC 6550 section 6.2, from the 6LoWPAN world),
has a minimum length of only six bytes, and is thus blocked by
icmpv6_filter.

RFC 4443 seems to allow even a zero-sized body, making the minimum
allowable message size four bytes.

Signed-off-by: Werner Almesberger <werner@almesberger.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 15:15:50 -07:00
Werner Almesberger
9cc08af3a1 icmpv6_filter: fix "_hdr" incorrectly being a pointer
"_hdr" should hold the ICMPv6 header while "hdr" is the pointer to it.
This worked by accident.

Signed-off-by: Werner Almesberger <werner@almesberger.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 15:15:50 -07:00
fan.du
439677d766 ipv6: bump genid when delete/add address
Server           Client
2001:1::803/64  <-> 2001:1::805/64
2001:2::804/64  <-> 2001:2::806/64

Server side fib binary tree looks like this:

                                   (2001:/64)
                                   /
                                  /
                   ffff88002103c380
                 /                 \
     (2)        /                   \
 (2001::803/128)                     ffff880037ac07c0
                                    /               \
                                   /                 \  (3)
                      ffff880037ac0640               (2001::806/128)
                       /             \
             (1)      /               \
        (2001::804/128)               (2001::805/128)

Delete 2001::804/64 won't cause prefix route deleted as well as rt in (3)
destinate to 2001::806 with source address as 2001::804/64. That's because
2001::803/64 is still alive, which make onlink=1 in ipv6_del_addr, this is
where the substantial difference between same prefix configuration and
different prefix configuration :) So packet are still transmitted out to
2001::806 with source address as 2001::804/64.

So bump genid will clear rt in (3), and up layer protocol will eventually
find the right one for themselves.

This problem arised from the discussion in here:
http://marc.info/?l=linux-netdev&m=137404469219410&w=4

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 16:33:32 -07:00
Jiri Benc
8a226b2cfa ipv6: prevent race between address creation and removal
There's a race in IPv6 automatic addess assignment. The address is created
with zero lifetime when it's added to various address lists. Before it gets
assigned the correct lifetime, there's a window where a new address may be
configured. This causes the semi-initiated address to be deleted in
addrconf_verify.

This was discovered as a reference leak caused by concurrent run of
__ipv6_ifa_notify for both RTM_NEWADDR and RTM_DELADDR with the same
address.

Fix this by setting the lifetime before the address is added to
inet6_addr_lst.

A few notes:

1. In addrconf_prefix_rcv, by setting update_lft to zero, the
   if (update_lft) { ... } condition is no longer executed for newly
   created addresses. This is okay, as the ifp fields are set in
   ipv6_add_addr now and ipv6_ifa_notify is called (and has been called)
   through addrconf_dad_start.

2. The removal of the whole block under ifp->lock in inet6_addr_add is okay,
   too, as tstamp is initialized to jiffies in ipv6_add_addr.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 14:16:20 -07:00
Jiri Pirko
3f8f52982a ipv6: move peer_addr init into ipv6_add_addr()
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 14:16:20 -07:00
Michal Kubeček
49a18d86f6 ipv6: update ip6_rt_last_gc every time GC is run
As pointed out by Eric Dumazet, net->ipv6.ip6_rt_last_gc should
hold the last time garbage collector was run so that we should
update it whenever fib6_run_gc() calls fib6_clean_all(), not only
if we got there from ip6_dst_gc().

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 14:16:20 -07:00
Michal Kubeček
2ac3ac8f86 ipv6: prevent fib6_run_gc() contention
On a high-traffic router with many processors and many IPv6 dst
entries, soft lockup in fib6_run_gc() can occur when number of
entries reaches gc_thresh.

This happens because fib6_run_gc() uses fib6_gc_lock to allow
only one thread to run the garbage collector but ip6_dst_gc()
doesn't update net->ipv6.ip6_rt_last_gc until fib6_run_gc()
returns. On a system with many entries, this can take some time
so that in the meantime, other threads pass the tests in
ip6_dst_gc() (ip6_rt_last_gc is still not updated) and wait for
the lock. They then have to run the garbage collector one after
another which blocks them for quite long.

Resolve this by replacing special value ~0UL of expire parameter
to fib6_run_gc() by explicit "force" parameter to choose between
spin_lock_bh() and spin_trylock_bh() and call fib6_run_gc() with
force=false if gc_thresh is reached but not max_size.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 14:16:20 -07:00
Hannes Frederic Sowa
46b3a42190 ipv6: fib6_rules should return exact return value
With the addition of the suppress operation
(7764a45a8f1fe74d4f7d301eaca2e558e7e2831a ("fib_rules: add .suppress
operation") we rely on accurate error reporting of the fib_rules.actions.

fib6_rule_action always returned -EAGAIN in case we could not find a
matching route and 0 if a rule was matched. This also included a match
for blackhole or prohibited rule actions which could get suppressed by
the new logic.

So adapt fib6_rule_action to always return the correct error code as
its counterpart fib4_rule_action does. This also fixes a possiblity of
nullptr-deref where we don't find a table, thus rt == NULL. Because
the condition rt != ip6_null_entry still holdes it seems we could later
get a nullptr bug on dereference rt->dst.

v2:
a) Fixed a brain fart in the commit msg (the rule => a table, etc). No
   changes to the patch.

Cc: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 00:26:22 -07:00
Stefan Tomanek
7764a45a8f fib_rules: add .suppress operation
This change adds a new operation to the fib_rules_ops struct; it allows the
suppression of routing decisions if certain criteria are not met by its
results.

The first implemented constraint is a minimum prefix length added to the
structures of routing rules. If a rule is added with a minimum prefix length
>0, only routes meeting this threshold will be considered. Any other (more
general) routing table entries will be ignored.

When configuring a system with multiple network uplinks and default routes, it
is often convinient to reference the main routing table multiple times - but
omitting the default route. Using this patch and a modified "ip" utility, this
can be achieved by using the following command sequence:

  $ ip route add table secuplink default via 10.42.23.1

  $ ip rule add pref 100            table main prefixlength 1
  $ ip rule add pref 150 fwmark 0xA table secuplink

With this setup, packets marked 0xA will be processed by the additional routing
table "secuplink", but only if no suitable route in the main routing table can
be found. By using a minimal prefixlength of 1, the default route (/0) of the
table "main" is hidden to packets processed by rule 100; packets traveling to
destinations with more specific routing entries are processed as usual.

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:27:17 -07:00
fan.du
ca4c3fc24e net: split rt_genid for ipv4 and ipv6
Current net name space has only one genid for both IPv4 and IPv6, it has below
drawbacks:

- Add/delete an IPv4 address will invalidate all IPv6 routing table entries.
- Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table
  entries even when the policy is only applied for one address family.

Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6
separately in a fine granularity.

Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 14:56:36 -07:00
Hannes Frederic Sowa
5ad37d5dee tcp: add tcp_syncookies mode to allow unconditionally generation of syncookies
| If you want to test which effects syncookies have to your
| network connections you can set this knob to 2 to enable
| unconditionally generation of syncookies.

Original idea and first implementation by Eric Dumazet.

Cc: Florian Westphal <fw@strlen.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 16:15:18 -07:00
Hannes Frederic Sowa
9d4a031464 ipv4, ipv6: send igmpv3/mld packets with TC_PRIO_CONTROL
v2:
a) Also send ipv4 igmp messages with TC_PRIO_CONTROL

Cc: William Manley <william.manley@youview.com>
Cc: Lukas Tribus <luky-37@hotmail.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-28 11:13:55 -07:00
Eric Dumazet
c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Hannes Frederic Sowa
905a6f96a1 ipv6: take rtnl_lock and mark mrt6 table as freed on namespace cleanup
Otherwise we end up dereferencing the already freed net->ipv6.mrt pointer
which leads to a panic (from Srivatsa S. Bhat):

BUG: unable to handle kernel paging request at ffff882018552020
IP: [<ffffffffa0366b02>] ip6mr_sk_done+0x32/0xb0 [ipv6]
PGD 290a067 PUD 207ffe0067 PMD 207ff1d067 PTE 8000002018552060
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: ebtable_nat ebtables nfs fscache nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables nfsd lockd nfs_acl exportfs auth_rpcgss autofs4 sunrpc 8021q garp bridge stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
+ip6_tables ipv6 vfat fat vhost_net macvtap macvlan vhost tun kvm_intel kvm uinput iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii microcode i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma dca mlx4_core be2net wmi acpi_cpufreq mperf ext4 jbd2 mbcache dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 7 Comm: kworker/u33:0 Not tainted 3.11.0-rc1-ea45e-a #4
Hardware name: IBM  -[8737R2A]-/00Y2738, BIOS -[B2E120RUS-1.20]- 11/30/2012
Workqueue: netns cleanup_net
task: ffff8810393641c0 ti: ffff881039366000 task.ti: ffff881039366000
RIP: 0010:[<ffffffffa0366b02>]  [<ffffffffa0366b02>] ip6mr_sk_done+0x32/0xb0 [ipv6]
RSP: 0018:ffff881039367bd8  EFLAGS: 00010286
RAX: ffff881039367fd8 RBX: ffff882018552000 RCX: dead000000200200
RDX: 0000000000000000 RSI: ffff881039367b68 RDI: ffff881039367b68
RBP: ffff881039367bf8 R08: ffff881039367b68 R09: 2222222222222222
R10: 2222222222222222 R11: 2222222222222222 R12: ffff882015a7a040
R13: ffff882014eb89c0 R14: ffff8820289e2800 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88103fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff882018552020 CR3: 0000000001c0b000 CR4: 00000000000407f0
Stack:
 ffff881039367c18 ffff882014eb89c0 ffff882015e28c00 0000000000000000
 ffff881039367c18 ffffffffa034d9d1 ffff8820289e2800 ffff882014eb89c0
 ffff881039367c58 ffffffff815bdecb ffffffff815bddf2 ffff882014eb89c0
Call Trace:
 [<ffffffffa034d9d1>] rawv6_close+0x21/0x40 [ipv6]
 [<ffffffff815bdecb>] inet_release+0xfb/0x220
 [<ffffffff815bddf2>] ? inet_release+0x22/0x220
 [<ffffffffa032686f>] inet6_release+0x3f/0x50 [ipv6]
 [<ffffffff8151c1d9>] sock_release+0x29/0xa0
 [<ffffffff81525520>] sk_release_kernel+0x30/0x70
 [<ffffffffa034f14b>] icmpv6_sk_exit+0x3b/0x80 [ipv6]
 [<ffffffff8152fff9>] ops_exit_list+0x39/0x60
 [<ffffffff815306fb>] cleanup_net+0xfb/0x1a0
 [<ffffffff81075e3a>] process_one_work+0x1da/0x610
 [<ffffffff81075dc9>] ? process_one_work+0x169/0x610
 [<ffffffff81076390>] worker_thread+0x120/0x3a0
 [<ffffffff81076270>] ? process_one_work+0x610/0x610
 [<ffffffff8107da2e>] kthread+0xee/0x100
 [<ffffffff8107d940>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff8162a99c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8107d940>] ? __init_kthread_worker+0x70/0x70
Code: 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 4c 8b 67 30 49 89 fd e8 db 3c 1e e1 49 8b 9c 24 90 08 00 00 48 85 db 74 06 <4c> 39 6b 20 74 20 bb f3 ff ff ff e8 8e 3c 1e e1 89 d8 4c 8b 65
RIP  [<ffffffffa0366b02>] ip6mr_sk_done+0x32/0xb0 [ipv6]
 RSP <ffff881039367bd8>
CR2: ffff882018552020

Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:02:13 -07:00
fan.du
9225b23057 net: ipv6 eliminate parameter "int addrlen" in function fib6_add_1
The "int addrlen" in fib6_add_1 is rebundant, as we can get it from
parameter "struct in6_addr *addr" once we modified its type.
And also fix some coding style issues in fib6_add_1

Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 14:24:56 -07:00
fan.du
86a37def2b net ipv6: Remove rebundant rt6i_nsiblings initialization
Seting rt->rt6i_nsiblings to zero is rebundant, because above memset
zeroed the rest of rt excluding the first dst memember.

Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 14:24:56 -07:00
Rami Rosen
2b52c3ada8 ip6mr: change the prototype of ip6_mr_forward().
This patch changes the prototpye of the ip6_mr_forward() method to return void
instead of int.

The ip6_mr_forward() method always returns 0; moreover, the return value of this
method is not checked anywhere.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 17:20:37 -07:00
Yuchung Cheng
375fe02c91 tcp: consolidate SYNACK RTT sampling
The first patch consolidates SYNACK and other RTT measurement to use a
central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
RTT measurement happens after PAWS check, potentially reducing the
impact of RTO seeding on bad TCP timestamps values.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Daniel Baluta
f2f79cca13 ndisc: bool initializations should use true and false
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-16 12:13:31 -07:00
Hannes Frederic Sowa
307f2fb95e ipv6: only static routes qualify for equal cost multipathing
Static routes in this case are non-expiring routes which did not get
configured by autoconf or by icmpv6 redirects.

To make sure we actually get an ecmp route while searching for the first
one in this fib6_node's leafs, also make sure it matches the ecmp route
assumptions.

v2:
a) Removed RTF_EXPIRE check in dst.from chain. The check of RTF_ADDRCONF
   already ensures that this route, even if added again without
   RTF_EXPIRES (in case of a RA announcement with infinite timeout),
   does not cause the rt6i_nsiblings logic to go wrong if a later RA
   updates the expiration time later.

v3:
a) Allow RTF_EXPIRES routes to enter the ecmp route set. We have to do so,
   because an pmtu event could update the RTF_EXPIRES flag and we would
   not count this route, if another route joins this set. We now filter
   only for RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC, which are flags that
   don't get changed after rt6_info construction.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-12 16:29:54 -07:00
Hannes Frederic Sowa
afc154e978 ipv6: fix route selection if kernel is not compiled with CONFIG_IPV6_ROUTER_PREF
This is a follow-up patch to 3630d40067a21d4dfbadc6002bb469ce26ac5d52
("ipv6: rt6_check_neigh should successfully verify neigh if no NUD
information are available").

Since the removal of rt->n in rt6_info we can end up with a dst ==
NULL in rt6_check_neigh. In case the kernel is not compiled with
CONFIG_IPV6_ROUTER_PREF we should also select a route with unkown
NUD state but we must not avoid doing round robin selection on routes
with the same target. So introduce and pass down a boolean ``do_rr'' to
indicate when we should update rt->rr_ptr. As soon as no route is valid
we do backtracking and do a lookup on a higher level in the fib trie.

v2:
a) Improved rt6_check_neigh logic (no need to create neighbour there)
   and documented return values.

v3:
a) Introduce enum rt6_nud_state to get rid of the magic numbers
   (thanks to David Miller).
b) Update and shorten commit message a bit to actualy reflect
   the source.

Reported-by: Pierre Emeriaud <petrus.lt@gmail.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11 11:51:10 -07:00
Hannes Frederic Sowa
1eb4f75828 ipv6: in case of link failure remove route directly instead of letting it expire
We could end up expiring a route which is part of an ecmp route set. Doing
so would invalidate the rt->rt6i_nsiblings calculations and could provoke
the following panic:

[   80.144667] ------------[ cut here ]------------
[   80.145172] kernel BUG at net/ipv6/ip6_fib.c:733!
[   80.145172] invalid opcode: 0000 [#1] SMP
[   80.145172] Modules linked in: 8021q nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables
+snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer virtio_balloon snd soundcore i2c_piix4 i2c_core virtio_net virtio_blk
[   80.145172] CPU: 1 PID: 786 Comm: ping6 Not tainted 3.10.0+ #118
[   80.145172] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   80.145172] task: ffff880117fa0000 ti: ffff880118770000 task.ti: ffff880118770000
[   80.145172] RIP: 0010:[<ffffffff815f3b5d>]  [<ffffffff815f3b5d>] fib6_add+0x75d/0x830
[   80.145172] RSP: 0018:ffff880118771798  EFLAGS: 00010202
[   80.145172] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011350e480
[   80.145172] RDX: ffff88011350e238 RSI: 0000000000000004 RDI: ffff88011350f738
[   80.145172] RBP: ffff880118771848 R08: ffff880117903280 R09: 0000000000000001
[   80.145172] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011350f680
[   80.145172] R13: ffff880117903280 R14: ffff880118771890 R15: ffff88011350ef90
[   80.145172] FS:  00007f02b5127740(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
[   80.145172] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   80.145172] CR2: 00007f981322a000 CR3: 00000001181b1000 CR4: 00000000000006e0
[   80.145172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   80.145172] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   80.145172] Stack:
[   80.145172]  0000000000000001 ffff880100000000 ffff880100000000 ffff880117903280
[   80.145172]  0000000000000000 ffff880119a4cf00 0000000000000400 00000000000007fa
[   80.145172]  0000000000000000 0000000000000000 0000000000000000 ffff88011350f680
[   80.145172] Call Trace:
[   80.145172]  [<ffffffff815eeceb>] ? rt6_bind_peer+0x4b/0x90
[   80.145172]  [<ffffffff815ed985>] __ip6_ins_rt+0x45/0x70
[   80.145172]  [<ffffffff815eee35>] ip6_ins_rt+0x35/0x40
[   80.145172]  [<ffffffff815ef1e4>] ip6_pol_route.isra.44+0x3a4/0x4b0
[   80.145172]  [<ffffffff815ef34a>] ip6_pol_route_output+0x2a/0x30
[   80.145172]  [<ffffffff81616077>] fib6_rule_action+0xd7/0x210
[   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
[   80.145172]  [<ffffffff81553026>] fib_rules_lookup+0xc6/0x140
[   80.145172]  [<ffffffff81616374>] fib6_rule_lookup+0x44/0x80
[   80.145172]  [<ffffffff815ef320>] ? ip6_pol_route_input+0x30/0x30
[   80.145172]  [<ffffffff815edea3>] ip6_route_output+0x73/0xb0
[   80.145172]  [<ffffffff815dfdf3>] ip6_dst_lookup_tail+0x2c3/0x2e0
[   80.145172]  [<ffffffff813007b1>] ? list_del+0x11/0x40
[   80.145172]  [<ffffffff81082a4c>] ? remove_wait_queue+0x3c/0x50
[   80.145172]  [<ffffffff815dfe4d>] ip6_dst_lookup_flow+0x3d/0xa0
[   80.145172]  [<ffffffff815fda77>] rawv6_sendmsg+0x267/0xc20
[   80.145172]  [<ffffffff815a8a83>] inet_sendmsg+0x63/0xb0
[   80.145172]  [<ffffffff8128eb93>] ? selinux_socket_sendmsg+0x23/0x30
[   80.145172]  [<ffffffff815218d6>] sock_sendmsg+0xa6/0xd0
[   80.145172]  [<ffffffff81524a68>] SYSC_sendto+0x128/0x180
[   80.145172]  [<ffffffff8109825c>] ? update_curr+0xec/0x170
[   80.145172]  [<ffffffff81041d09>] ? kvm_clock_get_cycles+0x9/0x10
[   80.145172]  [<ffffffff810afd1e>] ? __getnstimeofday+0x3e/0xd0
[   80.145172]  [<ffffffff8152509e>] SyS_sendto+0xe/0x10
[   80.145172]  [<ffffffff8164efd9>] system_call_fastpath+0x16/0x1b
[   80.145172] Code: fe ff ff 41 f6 45 2a 06 0f 85 ca fe ff ff 49 8b 7e 08 4c 89 ee e8 94 ef ff ff e9 b9 fe ff ff 48 8b 82 28 05 00 00 e9 01 ff ff ff <0f> 0b 49 8b 54 24 30 0d 00 00 40 00 89 83 14 01 00 00 48 89 53
[   80.145172] RIP  [<ffffffff815f3b5d>] fib6_add+0x75d/0x830
[   80.145172]  RSP <ffff880118771798>
[   80.387413] ---[ end trace 02f20b7a8b81ed95 ]---
[   80.390154] Kernel panic - not syncing: Fatal exception in interrupt

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 19:45:39 -07:00
Eliezer Tamir
8b80cda536 net: rename ll methods to busy-poll
Rename ndo_ll_poll to ndo_busy_poll.
Rename sk_mark_ll to sk_mark_napi_id.
Rename skb_mark_ll to skb_mark_napi_id.
Correct all useres of these functions.
Update comments and defines  in include/net/busy_poll.h

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Eliezer Tamir
076bb0c82a net: rename include/net/ll_poll.h to include/net/busy_poll.h
Rename the file and correct all the places where it is included.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Nicolas Dichtel
86bd68bfd7 sit: fix tunnel update via netlink
The device can stand in another netns, hence we need to do the lookup in netns
tunnel->net.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-04 14:55:47 -07:00
Hannes Frederic Sowa
3630d40067 ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
After the removal of rt->n we do not create a neighbour entry at route
insertion time (rt6_bind_neighbour is gone). As long as no neighbour is
created because of "useful traffic" we skip this routing entry because
rt6_check_neigh cannot pick up a valid neighbour (neigh == NULL) and
thus returns false.

This change was introduced by commit
887c95cc1da53f66a5890fdeab13414613010097 ("ipv6: Complete neighbour
entry removal from dst_entry.")

To quote RFC4191:
"If the host has no information about the router's reachability, then
the host assumes the router is reachable."

and also:
"A host MUST NOT probe a router's reachability in the absence of useful
traffic that the host would have sent to the router if it were reachable."

So, just assume the router is reachable and let's rt6_probe do the
rest. We don't need to create a neighbour on route insertion time.

If we don't compile with CONFIG_IPV6_ROUTER_PREF (RFC4191 support)
a neighbour is only valid if its nud_state is NUD_VALID. I did not find
any references that we should probe the router on route insertion time
via the other RFCs. So skip this route in that case.

v2:
a) use IS_ENABLED instead of #ifdefs (thanks to Sergei Shtylyov)

Reported-by: Pierre Emeriaud <petrus.lt@gmail.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 17:50:51 -07:00
Lorenzo Colitti
fbfe80c890 net: ipv6: fix wrong ping_v6_sendmsg return value
ping_v6_sendmsg currently returns 0 on success. It should return
the number of bytes written instead.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 17:42:05 -07:00
Lorenzo Colitti
a1bdc45580 net: ipv6: add missing lock in ping_v6_sendmsg
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 17:40:58 -07:00
David S. Miller
0c1072ae02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/freescale/fec_main.c
	drivers/net/ethernet/renesas/sh_eth.c
	net/ipv4/gre.c

The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
and the splitting of the gre.c code into seperate files.

The FEC conflict was two sets of changes adding ethtool support code
in an "!CONFIG_M5272" CPP protected block.

Finally the sh_eth.c conflict was between one commit add bits set
in the .eesr_err_check mask whilst another commit removed the
.tx_error_check member and assignments.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 14:55:13 -07:00
Hannes Frederic Sowa
75a493e60a ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size
If the socket had an IPV6_MTU value set, ip6_append_data_mtu lost track
of this when appending the second frame on a corked socket. This results
in the following splat:

[37598.993962] ------------[ cut here ]------------
[37598.994008] kernel BUG at net/core/skbuff.c:2064!
[37598.994008] invalid opcode: 0000 [#1] SMP
[37598.994008] Modules linked in: tcp_lp uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media vfat fat usb_storage fuse ebtable_nat xt_CHECKSUM bridge stp llc ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat
+nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi
+scsi_transport_iscsi rfcomm bnep iTCO_wdt iTCO_vendor_support snd_hda_codec_conexant arc4 iwldvm mac80211 snd_hda_intel acpi_cpufreq mperf coretemp snd_hda_codec microcode cdc_wdm cdc_acm
[37598.994008]  snd_hwdep cdc_ether snd_seq snd_seq_device usbnet mii joydev btusb snd_pcm bluetooth i2c_i801 e1000e lpc_ich mfd_core ptp iwlwifi pps_core snd_page_alloc mei cfg80211 snd_timer thinkpad_acpi snd tpm_tis soundcore rfkill tpm tpm_bios vhost_net tun macvtap macvlan kvm_intel kvm uinput binfmt_misc
+dm_crypt i915 i2c_algo_bit drm_kms_helper drm i2c_core wmi video
[37598.994008] CPU 0
[37598.994008] Pid: 27320, comm: t2 Not tainted 3.9.6-200.fc18.x86_64 #1 LENOVO 27744PG/27744PG
[37598.994008] RIP: 0010:[<ffffffff815443a5>]  [<ffffffff815443a5>] skb_copy_and_csum_bits+0x325/0x330
[37598.994008] RSP: 0018:ffff88003670da18  EFLAGS: 00010202
[37598.994008] RAX: ffff88018105c018 RBX: 0000000000000004 RCX: 00000000000006c0
[37598.994008] RDX: ffff88018105a6c0 RSI: ffff88018105a000 RDI: ffff8801e1b0aa00
[37598.994008] RBP: ffff88003670da78 R08: 0000000000000000 R09: ffff88018105c040
[37598.994008] R10: ffff8801e1b0aa00 R11: 0000000000000000 R12: 000000000000fff8
[37598.994008] R13: 00000000000004fc R14: 00000000ffff0504 R15: 0000000000000000
[37598.994008] FS:  00007f28eea59740(0000) GS:ffff88023bc00000(0000) knlGS:0000000000000000
[37598.994008] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[37598.994008] CR2: 0000003d935789e0 CR3: 00000000365cb000 CR4: 00000000000407f0
[37598.994008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37598.994008] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[37598.994008] Process t2 (pid: 27320, threadinfo ffff88003670c000, task ffff88022c162ee0)
[37598.994008] Stack:
[37598.994008]  ffff88022e098a00 ffff88020f973fc0 0000000000000008 00000000000004c8
[37598.994008]  ffff88020f973fc0 00000000000004c4 ffff88003670da78 ffff8801e1b0a200
[37598.994008]  0000000000000018 00000000000004c8 ffff88020f973fc0 00000000000004c4
[37598.994008] Call Trace:
[37598.994008]  [<ffffffff815fc21f>] ip6_append_data+0xccf/0xfe0
[37598.994008]  [<ffffffff8158d9f0>] ? ip_copy_metadata+0x1a0/0x1a0
[37598.994008]  [<ffffffff81661f66>] ? _raw_spin_lock_bh+0x16/0x40
[37598.994008]  [<ffffffff8161548d>] udpv6_sendmsg+0x1ed/0xc10
[37598.994008]  [<ffffffff812a2845>] ? sock_has_perm+0x75/0x90
[37598.994008]  [<ffffffff815c3693>] inet_sendmsg+0x63/0xb0
[37598.994008]  [<ffffffff812a2973>] ? selinux_socket_sendmsg+0x23/0x30
[37598.994008]  [<ffffffff8153a450>] sock_sendmsg+0xb0/0xe0
[37598.994008]  [<ffffffff810135d1>] ? __switch_to+0x181/0x4a0
[37598.994008]  [<ffffffff8153d97d>] sys_sendto+0x12d/0x180
[37598.994008]  [<ffffffff810dfb64>] ? __audit_syscall_entry+0x94/0xf0
[37598.994008]  [<ffffffff81020ed1>] ? syscall_trace_enter+0x231/0x240
[37598.994008]  [<ffffffff8166a7e7>] tracesys+0xdd/0xe2
[37598.994008] Code: fe 07 00 00 48 c7 c7 04 28 a6 81 89 45 a0 4c 89 4d b8 44 89 5d a8 e8 1b ac b1 ff 44 8b 5d a8 4c 8b 4d b8 8b 45 a0 e9 cf fe ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 48
[37598.994008] RIP  [<ffffffff815443a5>] skb_copy_and_csum_bits+0x325/0x330
[37598.994008]  RSP <ffff88003670da18>
[37599.007323] ---[ end trace d69f6a17f8ac8eee ]---

While there, also check if path mtu discovery is activated for this
socket. The logic was adapted from ip6_append_data when first writing
on the corked socket.

This bug was introduced with commit
0c1833797a5a6ec23ea9261d979aa18078720b74 ("ipv6: fix incorrect ipsec
fragment").

v2:
a) Replace IPV6_PMTU_DISC_DO with IPV6_PMTUDISC_PROBE.
b) Don't pass ipv6_pinfo to ip6_append_data_mtu (suggestion by Gao
   feng, thanks!).
c) Change mtu to unsigned int, else we get a warning about
   non-matching types because of the min()-macro type-check.

Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02 12:44:18 -07:00
Hannes Frederic Sowa
8822b64a0f ipv6: call udp_push_pending_frames when uncorking a socket with AF_INET pending data
We accidentally call down to ip6_push_pending_frames when uncorking
pending AF_INET data on a ipv6 socket. This results in the following
splat (from Dave Jones):

skbuff: skb_under_panic: text:ffffffff816765f6 len:48 put:40 head:ffff88013deb6df0 data:ffff88013deb6dec tail:0x2c end:0xc0 dev:<NULL>
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:126!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: dccp_ipv4 dccp 8021q garp bridge stp dlci mpoa snd_seq_dummy sctp fuse hidp tun bnep nfnetlink scsi_transport_iscsi rfcomm can_raw can_bcm af_802154 appletalk caif_socket can caif ipt_ULOG x25 rose af_key pppoe pppox ipx phonet irda llc2 ppp_generic slhc p8023 psnap p8022 llc crc_ccitt atm bluetooth
+netrom ax25 nfc rfkill rds af_rxrpc coretemp hwmon kvm_intel kvm crc32c_intel snd_hda_codec_realtek ghash_clmulni_intel microcode pcspkr snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep usb_debug snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
CPU: 2 PID: 8095 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #37
task: ffff8801f52c2520 ti: ffff8801e6430000 task.ti: ffff8801e6430000
RIP: 0010:[<ffffffff816e759c>]  [<ffffffff816e759c>] skb_panic+0x63/0x65
RSP: 0018:ffff8801e6431de8  EFLAGS: 00010282
RAX: 0000000000000086 RBX: ffff8802353d3cc0 RCX: 0000000000000006
RDX: 0000000000003b90 RSI: ffff8801f52c2ca0 RDI: ffff8801f52c2520
RBP: ffff8801e6431e08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88022ea0c800
R13: ffff88022ea0cdf8 R14: ffff8802353ecb40 R15: ffffffff81cc7800
FS:  00007f5720a10740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000005862000 CR3: 000000022843c000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffff88013deb6dec 000000000000002c 00000000000000c0 ffffffff81a3f6e4
 ffff8801e6431e18 ffffffff8159a9aa ffff8801e6431e90 ffffffff816765f6
 ffffffff810b756b 0000000700000002 ffff8801e6431e40 0000fea9292aa8c0
Call Trace:
 [<ffffffff8159a9aa>] skb_push+0x3a/0x40
 [<ffffffff816765f6>] ip6_push_pending_frames+0x1f6/0x4d0
 [<ffffffff810b756b>] ? mark_held_locks+0xbb/0x140
 [<ffffffff81694919>] udp_v6_push_pending_frames+0x2b9/0x3d0
 [<ffffffff81694660>] ? udplite_getfrag+0x20/0x20
 [<ffffffff8162092a>] udp_lib_setsockopt+0x1aa/0x1f0
 [<ffffffff811cc5e7>] ? fget_light+0x387/0x4f0
 [<ffffffff816958a4>] udpv6_setsockopt+0x34/0x40
 [<ffffffff815949f4>] sock_common_setsockopt+0x14/0x20
 [<ffffffff81593c31>] SyS_setsockopt+0x71/0xd0
 [<ffffffff816f5d54>] tracesys+0xdd/0xe2
Code: 00 00 48 89 44 24 10 8b 87 d8 00 00 00 48 89 44 24 08 48 8b 87 e8 00 00 00 48 c7 c7 c0 04 aa 81 48 89 04 24 31 c0 e8 e1 7e ff ff <0f> 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55
RIP  [<ffffffff816e759c>] skb_panic+0x63/0x65
 RSP <ffff8801e6431de8>

This patch adds a check if the pending data is of address family AF_INET
and directly calls udp_push_ending_frames from udp_v6_push_pending_frames
if that is the case.

This bug was found by Dave Jones with trinity.

(Also move the initialization of fl6 below the AF_INET check, even if
not strictly necessary.)

Cc: Dave Jones <davej@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02 12:44:18 -07:00
Amerigo Wang
8965779d2c ipv6,mcast: always hold idev->lock before mca_lock
dingtianhong reported the following deadlock detected by lockdep:

 ======================================================
 [ INFO: possible circular locking dependency detected ]
 3.4.24.05-0.1-default #1 Not tainted
 -------------------------------------------------------
 ksoftirqd/0/3 is trying to acquire lock:
  (&ndev->lock){+.+...}, at: [<ffffffff8147f804>] ipv6_get_lladdr+0x74/0x120

 but task is already holding lock:
  (&mc->mca_lock){+.+...}, at: [<ffffffff8149d130>] mld_send_report+0x40/0x150

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&mc->mca_lock){+.+...}:
        [<ffffffff810a8027>] validate_chain+0x637/0x730
        [<ffffffff810a8417>] __lock_acquire+0x2f7/0x500
        [<ffffffff810a8734>] lock_acquire+0x114/0x150
        [<ffffffff814f691a>] rt_spin_lock+0x4a/0x60
        [<ffffffff8149e4bb>] igmp6_group_added+0x3b/0x120
        [<ffffffff8149e5d8>] ipv6_mc_up+0x38/0x60
        [<ffffffff81480a4d>] ipv6_find_idev+0x3d/0x80
        [<ffffffff81483175>] addrconf_notify+0x3d5/0x4b0
        [<ffffffff814fae3f>] notifier_call_chain+0x3f/0x80
        [<ffffffff81073471>] raw_notifier_call_chain+0x11/0x20
        [<ffffffff813d8722>] call_netdevice_notifiers+0x32/0x60
        [<ffffffff813d92d4>] __dev_notify_flags+0x34/0x80
        [<ffffffff813d9360>] dev_change_flags+0x40/0x70
        [<ffffffff813ea627>] do_setlink+0x237/0x8a0
        [<ffffffff813ebb6c>] rtnl_newlink+0x3ec/0x600
        [<ffffffff813eb4d0>] rtnetlink_rcv_msg+0x160/0x310
        [<ffffffff814040b9>] netlink_rcv_skb+0x89/0xb0
        [<ffffffff813eb357>] rtnetlink_rcv+0x27/0x40
        [<ffffffff81403e20>] netlink_unicast+0x140/0x180
        [<ffffffff81404a9e>] netlink_sendmsg+0x33e/0x380
        [<ffffffff813c4252>] sock_sendmsg+0x112/0x130
        [<ffffffff813c537e>] __sys_sendmsg+0x44e/0x460
        [<ffffffff813c5544>] sys_sendmsg+0x44/0x70
        [<ffffffff814feab9>] system_call_fastpath+0x16/0x1b

 -> #0 (&ndev->lock){+.+...}:
        [<ffffffff810a798e>] check_prev_add+0x3de/0x440
        [<ffffffff810a8027>] validate_chain+0x637/0x730
        [<ffffffff810a8417>] __lock_acquire+0x2f7/0x500
        [<ffffffff810a8734>] lock_acquire+0x114/0x150
        [<ffffffff814f6c82>] rt_read_lock+0x42/0x60
        [<ffffffff8147f804>] ipv6_get_lladdr+0x74/0x120
        [<ffffffff8149b036>] mld_newpack+0xb6/0x160
        [<ffffffff8149b18b>] add_grhead+0xab/0xc0
        [<ffffffff8149d03b>] add_grec+0x3ab/0x460
        [<ffffffff8149d14a>] mld_send_report+0x5a/0x150
        [<ffffffff8149f99e>] igmp6_timer_handler+0x4e/0xb0
        [<ffffffff8105705a>] call_timer_fn+0xca/0x1d0
        [<ffffffff81057b9f>] run_timer_softirq+0x1df/0x2e0
        [<ffffffff8104e8c7>] handle_pending_softirqs+0xf7/0x1f0
        [<ffffffff8104ea3b>] __do_softirq_common+0x7b/0xf0
        [<ffffffff8104f07f>] __thread_do_softirq+0x1af/0x210
        [<ffffffff8104f1c1>] run_ksoftirqd+0xe1/0x1f0
        [<ffffffff8106c7de>] kthread+0xae/0xc0
        [<ffffffff814fff74>] kernel_thread_helper+0x4/0x10

actually we can just hold idev->lock before taking pmc->mca_lock,
and avoid taking idev->lock again when iterating idev->addr_list,
since the upper callers of mld_newpack() already take
read_lock_bh(&idev->lock).

Reported-by: dingtianhong <dingtianhong@huawei.com>
Cc: dingtianhong <dingtianhong@huawei.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Tested-by: Ding Tianhong <dingtianhong@huawei.com>
Tested-by: Chen Weilong <chenweilong@huawei.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01 23:39:21 -07:00
Nicolas Dichtel
52bd4c0c15 ipv6: fix ecmp lookup when oif is specified
There is no reason to skip ECMP lookup when oif is specified, but this implies
to check oif given by user when selecting another route.
When the new route does not match oif requirement, we simply keep the initial
one.

Spotted-by: dingzhi <zhi.ding@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01 13:27:38 -07:00