Commit Graph

209 Commits

Author SHA1 Message Date
Eric W. Biederman
29a26a5680 netfilter: Pass struct net into the netfilter hooks
Pass a network namespace parameter into the netfilter hooks.  At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.

This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".

In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.

The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume()         xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont()      ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont()          ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc()                    sock_net(sk)
ipv6/ip6_output.c:ip6_xmit()			sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb()                   dev_net(skb->dev) not dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc()                   sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)".  I am documenting them in case something odd
pops up and someone starts trying to track down what happened.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:37 -07:00
Eric W. Biederman
f9e4306fd8 arp: Introduce arp_xmit_finish
The function dev_queue_xmit_skb_sk is unncessary and very confusing.
Introduce arp_xmit_finish to remove the need for dev_queue_xmit_skb_sk,
and have arp_xmit_finish call dev_queue_xmit.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:35 -07:00
David Ahern
30bbaa1950 net: Fix up inet_addr_type checks
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
if the passed in device is enslaved to a VRF then the table for that VRF
is used for the lookup.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:21 -07:00
David S. Miller
5510b3c2a1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/s390/net/bpf_jit_comp.c
	drivers/net/ethernet/ti/netcp_ethss.c
	net/bridge/br_multicast.c
	net/ipv4/ip_fragment.c

All four conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 23:52:20 -07:00
Eric Dumazet
11c91ef98f arp: filter NOARP neighbours for SIOCGARP
When arp is off on a device, and ioctl(SIOCGARP) is queried,
a buggy answer is given with MAC address of the device, instead
of the mac address of the destination/gateway.

We filter out NUD_NOARP neighbours for /proc/net/arp,
we must do the same for SIOCGARP ioctl.

Tested:

lpaa23:~# ./arp 10.246.7.190
MAC=00:01:e8:22:cb:1d      // correct answer

lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# cat /proc/net/arp   # check arp table is now 'empty'
IP address       HW type     Flags       HW address    Mask     Device
lpaa23:~# ./arp 10.246.7.190
MAC=00:1a:11:c3:0d:7f   // buggy answer before patch (this is eth0 mac)

After patch :

lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# ./arp 10.246.7.190
ioctl(SIOCGARP) failed: No such device or address

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Vytautas Valancius <valas@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-28 23:41:24 -07:00
Thomas Graf
0accfc268f arp: Inherit metadata dst when creating ARP requests
If output device wants to see the dst, inherit the dst of the
original skb and pass it on to generate the ARP request.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:05 -07:00
David Miller
7026b1ddb6 netfilter: Pass socket pointer down through okfn().
On the output paths in particular, we have to sometimes deal with two
socket contexts.  First, and usually skb->sk, is the local socket that
generated the frame.

And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.

We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.

The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device.  We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 15:25:55 -04:00
Ian Morris
00db41243e ipv4: coding style: comparison for inequality with NULL
The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-03 12:11:15 -04:00
Ian Morris
51456b2914 ipv4: coding style: comparison for equality with NULL
The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-03 12:11:15 -04:00
Eric W. Biederman
60395a20ff neigh: Factor out ___neigh_lookup_noref
While looking at the mpls code I found myself writing yet another
version of neigh_lookup_noref.  We currently have __ipv4_lookup_noref
and __ipv6_lookup_noref.

So to make my work a little easier and to make it a smidge easier to
verify/maintain the mpls code in the future I stopped and wrote
___neigh_lookup_noref.  Then I rewote __ipv4_lookup_noref and
__ipv6_lookup_noref in terms of this new function.  I tested my new
version by verifying that the same code is generated in
ip_finish_output2 and ip6_finish_output2 where these functions are
inlined.

To get to ___neigh_lookup_noref I added a new neighbour cache table
function key_eq.  So that the static size of the key would be
available.

I also added __neigh_lookup_noref for people who want to to lookup
a neighbour table entry quickly but don't know which neibhgour table
they are going to look up.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 00:23:23 -05:00
Eric W. Biederman
bdf53c5849 neigh: Don't require dst in neigh_hh_init
- Add protocol to neigh_tbl so that dst->ops->protocol is not needed
- Acquire the device from neigh->dev

This results in a neigh_hh_init that will cache the samve values
regardless of the packets flowing through it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 16:43:41 -05:00
Eric W. Biederman
59b2af26b9 arp: Kill arp_find
There are no more callers so kill this function.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 16:43:41 -05:00
Eric W. Biederman
21bfb8e933 arp: Remove special case to give AX25 it's open arp operations.
The special case has been pushed out into ax25_neigh_construct so there
is no need to keep this code in arp.c

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 16:43:40 -05:00
WANG Cong
d7480fd3b1 neigh: remove dynamic neigh table registration support
Currently there are only three neigh tables in the whole kernel:
arp table, ndisc table and decnet neigh table. What's more,
we don't support registering multiple tables per family.
Therefore we can just make these tables statically built-in.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 15:23:54 -05:00
Rick Jones
825bae5d97 arp: Do not perturb drop profiles with ignored ARP packets
We do not wish to disturb dropwatch or perf drop profiles with an ARP
we will ignore.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28 17:30:35 -04:00
Salam Noureddine
56022a8fdd ipv4: arp: update neighbour address when a gratuitous arp is received and arp_accept is set
Gratuitous arp packets are useful in switchover scenarios to update
client arp tables as quickly as possible. Currently, the mac address
of a neighbour is only updated after a locktime period has elapsed
since the last update. In most use cases such delays are unacceptable
for network admins. Moreover, the "updated" field of the neighbour
stucture doesn't record the last time the address of a neighbour
changed but records any change that happens to the neighbour. This is
clearly a bug since locktime uses that field as meaning "addr_updated".
With this observation, I was able to perpetuate a stale address by
sending a stream of gratuitous arp packets spaced less than locktime
apart. With this change the address is updated when a gratuitous arp
is received and the arp_accept sysctl is set.

Signed-off-by: Salam Noureddine <noureddine@aristanetworks.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 00:08:38 -05:00
Stephen Hemminger
7195cf7221 arp: make arp_invalidate static
Don't export arp_invalidate, only used in arp.c

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-28 17:02:46 -05:00
Nicolas Dichtel
b601fa197f ipv4: fix wildcard search with inet_confirm_addr()
Help of this function says: "in_dev: only on this interface, 0=any interface",
but since commit 39a6d06300 ("[NETNS]: Process inet_confirm_addr in the
correct namespace."), the code supposes that it will never be NULL. This
function is never called with in_dev == NULL, but it's exported and may be used
by an external module.

Because this patch restore the ability to call inet_confirm_addr() with in_dev
== NULL, I partially revert the above commit, as suggested by Julian.

CC: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 14:47:40 -05:00
Jiri Pirko
73af614aed neigh: use tbl->family to distinguish ipv4 from ipv6
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:56:12 -05:00
Jiri Pirko
1f9248e560 neigh: convert parms to an array
This patch converts the neigh param members to an array. This allows easier
manipulation which will be needed later on to provide better management of
default values.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:56:12 -05:00
Tim Gardner
3e25c65ed0 net: neighbour: Remove CONFIG_ARPD
This config option is superfluous in that it only guards a call
to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
change in behavior. There will now be call to __neigh_notify()
for each ARP resolution, which has no impact unless there is a
user space daemon waiting to receive the notification, i.e.,
the case for which CONFIG_ARPD was designed anyways.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Joe Perches <joe@perches.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03 21:41:43 -04:00
Timo Teräs
6c8b4e3ff8 arp: flush arp cache on IFF_NOARP change
IFF_NOARP affects what kind of neighbor entries are created
(nud NOARP or nud INCOMPLETE). If the flag changes, flush the arp
cache to refresh all entries.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>

v2->v3: shortened notifier_info struct name
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 13:11:02 -07:00
Jiri Pirko
351638e7de net: pass info struct via netdevice notifier
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>

v2->v3: fix typo on simeth
	shortened dev_getter
	shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 13:11:01 -07:00
YOSHIFUJI Hideaki / 吉藤英明
6752c8db8e firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection.
Inspection of upper layer protocol is considered harmful, especially
if it is about ARP or other stateful upper layer protocol; driver
cannot (and should not) have full state of them.

IPv4 over Firewire module used to inspect ARP (both in sending path
and in receiving path), and record peer's GUID, max packet size, max
speed and fifo address.  This patch removes such inspection by extending
our "hardware address" definition to include other information as well:
max packet size, max speed and fifo.  By doing this, The neighbour
module in networking subsystem can cache them.

Note: As we have started ignoring sspd and max_rec in ARP/NDP, those
      information will not be used in the driver when sending.

When a packet is being sent, the IP layer fills our pseudo header with
the extended "hardware address", including GUID and fifo.  The driver
can look-up node-id (the real but rather volatile low-level address)
by GUID, and then the module can send the packet to the wire using
parameters provided in the extendedn hardware address.

This approach is realistic because IP over IEEE1394 (RFC2734) and IPv6
over IEEE1394 (RFC3146) share same "hardware address" format
in their address resolution protocols.

Here, extended "hardware address" is defined as follows:

union fwnet_hwaddr {
	u8 u[16];
	struct {
		__be64 uniq_id;		/* EUI-64			*/
		u8 max_rec;		/* max packet size		*/
		u8 sspd;		/* max speed			*/
		__be16 fifo_hi;		/* hi 16bits of FIFO addr	*/
		__be32 fifo_lo;		/* lo 32bits of FIFO addr	*/
	} __packed uc;
};

Note that Hardware address is declared as union, so that we can map full
IP address into this, when implementing MCAP (Multicast Cannel Allocation
Protocol) for IPv6, but IP and ARP subsystem do not need to know this
format in detail.

One difference between original ARP (RFC826) and 1394 ARP (RFC2734)
is that 1394 ARP Request/Reply do not contain the target hardware address
field (aka ar$tha).  This difference is handled in the ARP subsystem.

CC: Stephan Gatzka <stephan.gatzka@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:32:13 -04:00
Gao feng
ece31ffd53 net: proc: change proc_net_remove to remove_proc_entry
proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
Gao feng
d4beaa66ad net: proc: change proc_net_fops_create to proc_create
Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.

It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
Eric Dumazet
044453b3ef arp: fix possible crash in arp_rcv()
We should call skb_share_check() before pskb_may_pull(), or we
can crash in pskb_expand_head()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-10 20:39:39 -05:00
Cong Wang
cf0be88057 arp: fix a regression in arp_solicit()
Sedat reported the following commit caused a regression:

commit 9650388b5c
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Dec 21 07:32:10 2012 +0000

    ipv4: arp: fix a lockdep splat in arp_solicit

This is due to the 6th parameter of arp_send() needs to be NULL
for the broadcast case, the above commit changed it to an all-zero
array by mistake.

Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-24 18:42:58 -08:00
Eric Dumazet
9650388b5c ipv4: arp: fix a lockdep splat in arp_solicit()
Yan Burman reported following lockdep warning :

=============================================
[ INFO: possible recursive locking detected ]
3.7.0+ #24 Not tainted
---------------------------------------------
swapper/1/0 is trying to acquire lock:
  (&n->lock){++--..}, at: [<ffffffff8139f56e>] __neigh_event_send
+0x2e/0x2f0

but task is already holding lock:
  (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit+0x1d4/0x280

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&n->lock);
   lock(&n->lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

4 locks held by swapper/1/0:
  #0:  (((&n->timer))){+.-...}, at: [<ffffffff8104b350>]
call_timer_fn+0x0/0x1c0
  #1:  (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit
+0x1d4/0x280
  #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff81395400>]
dev_queue_xmit+0x0/0x5d0
  #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff813cb41e>]
ip_finish_output+0x13e/0x640

stack backtrace:
Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
Call Trace:
  <IRQ>  [<ffffffff8108c7ac>] validate_chain+0xdcc/0x11f0
  [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
  [<ffffffff81120565>] ? kmem_cache_free+0xe5/0x1c0
  [<ffffffff8108d570>] __lock_acquire+0x440/0xc30
  [<ffffffff813c3570>] ? inet_getpeer+0x40/0x600
  [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
  [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
  [<ffffffff8108ddf5>] lock_acquire+0x95/0x140
  [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
  [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
  [<ffffffff81448d4b>] _raw_write_lock_bh+0x3b/0x50
  [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
  [<ffffffff8139f56e>] __neigh_event_send+0x2e/0x2f0
  [<ffffffff8139f99b>] neigh_resolve_output+0x16b/0x270
  [<ffffffff813cb62d>] ip_finish_output+0x34d/0x640
  [<ffffffff813cb41e>] ? ip_finish_output+0x13e/0x640
  [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
  [<ffffffff813cb9a0>] ip_output+0x80/0xf0
  [<ffffffff813ca368>] ip_local_out+0x28/0x80
  [<ffffffffa046f25a>] vxlan_xmit+0x66a/0xbec [vxlan]
  [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
  [<ffffffff81394a50>] ? skb_gso_segment+0x2b0/0x2b0
  [<ffffffff81449355>] ? _raw_spin_unlock_irqrestore+0x65/0x80
  [<ffffffff81394c57>] ? dev_queue_xmit_nit+0x207/0x270
  [<ffffffff813950c8>] dev_hard_start_xmit+0x298/0x5d0
  [<ffffffff813956f3>] dev_queue_xmit+0x2f3/0x5d0
  [<ffffffff81395400>] ? dev_hard_start_xmit+0x5d0/0x5d0
  [<ffffffff813f5788>] arp_xmit+0x58/0x60
  [<ffffffff813f59db>] arp_send+0x3b/0x40
  [<ffffffff813f6424>] arp_solicit+0x204/0x280
  [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
  [<ffffffff8139f515>] neigh_probe+0x45/0x70
  [<ffffffff813a1c10>] neigh_timer_handler+0x1a0/0x2a0
  [<ffffffff8104b3cf>] call_timer_fn+0x7f/0x1c0
  [<ffffffff8104b350>] ? detach_if_pending+0x120/0x120
  [<ffffffff8104b748>] run_timer_softirq+0x238/0x2b0
  [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
  [<ffffffff81043e51>] __do_softirq+0x101/0x280
  [<ffffffff814518cc>] call_softirq+0x1c/0x30
  [<ffffffff81003b65>] do_softirq+0x85/0xc0
  [<ffffffff81043a7e>] irq_exit+0x9e/0xc0
  [<ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0
  [<ffffffff8145122f>] apic_timer_interrupt+0x6f/0x80
  <EOI>  [<ffffffff8100a054>] ? mwait_idle+0xa4/0x1c0
  [<ffffffff8100a04b>] ? mwait_idle+0x9b/0x1c0
  [<ffffffff8100a6a9>] cpu_idle+0x89/0xe0
  [<ffffffff81441127>] start_secondary+0x1b2/0x1b6

Bug is from arp_solicit(), releasing the neigh lock after arp_send()
In case of vxlan, we eventually need to write lock a neigh lock later.

Its a false positive, but we can get rid of it without lockdep
annotations.

We can instead use neigh_ha_snapshot() helper.

Reported-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-21 13:14:07 -08:00
Eric W. Biederman
52e804c6df net: Allow userns root to control ipv4
Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.

In general policy and network stack state changes are allowed
while resource control is left unchanged.

Allow creating raw sockets.
Allow the SIOCSARP ioctl to control the arp cache.
Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting gre tunnels.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipip tunnels.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipsec virtual tunnel interfaces.

Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
sockets.

Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
arbitrary ip options.

Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
Allow setting the IP_TRANSPARENT ipv4 socket option.
Allow setting the TCP_REPAIR socket option.
Allow setting the TCP_CONGESTION socket option.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 20:32:45 -05:00
Nicolas Dichtel
bafa6d9d89 ipv4/route: arg delay is useless in rt_cache_flush()
Since route cache deletion (89aef8921b), delay is no
more used. Remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18 15:44:34 -04:00
David S. Miller
c6cffba4ff ipv4: Fix input route performance regression.
With the routing cache removal we lost the "noref" code paths on
input, and this can kill some routing workloads.

Reinstate the noref path when we hit a cached route in the FIB
nexthops.

With help from Eric Dumazet.

Reported-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-26 15:50:39 -07:00
David S. Miller
f8126f1d51 ipv4: Adjust semantics of rt->rt_gateway.
In order to allow prefixed routes, we have to adjust how rt_gateway
is set and interpreted.

The new interpretation is:

1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

2) rt_gateway != 0, destination requires a nexthop gateway

Abstract the fetching of the proper nexthop value using a new
inline helper, rt_nexthop(), as suggested by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
2012-07-20 13:31:20 -07:00
David Miller
38a424e465 ipv4: Kill ip_route_input_noref().
The "noref" argument to ip_route_input_common() is now always ignored
because we do not cache routes, and in that case we must always grab
a reference to the resulting 'dst'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:30:59 -07:00
David S. Miller
c10237e077 Revert "ipv4: tcp: dont cache unconfirmed intput dst"
This reverts commit c074da2810.

This change has several unwanted side effects:

1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
   thus never create a real cached route.

2) All TCP traffic will use DST_NOCACHE and never use the routing
   cache at all.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27 17:05:06 -07:00
Eric Dumazet
c074da2810 ipv4: tcp: dont cache unconfirmed intput dst
DDOS synflood attacks hit badly IP route cache.

On typical machines, this cache is allowed to hold up to 8 Millions dst
entries, 256 bytes for each, for a total of 2GB of memory.

rt_garbage_collect() triggers and tries to cleanup things.

Eventually route cache is disabled but machine is under fire and might
OOM and crash.

This patch exploits the new TCP early demux, to set a nocache
boolean in case incoming TCP frame is for a not yet ESTABLISHED or
TIMEWAIT socket.

This 'nocache' boolean is then used in case dst entry is not found in
route cache, to create an unhashed dst entry (DST_NOCACHE)

SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
output dst for syncookies), so after this patch, a machine is able to
absorb a DDOS synflood attack without polluting its IP route cache.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27 15:34:24 -07:00
Thomas Graf
d0daebc3d6 ipv4: Add interface option to enable routing of 127.0.0.0/8
Routing of 127/8 is tradtionally forbidden, we consider
packets from that address block martian when routing and do
not process corresponding ARP requests.

This is a sane default but renders a huge address space
practically unuseable.

The RFC states that no address within the 127/8 block should
ever appear on any network anywhere but it does not forbid
the use of such addresses outside of the loopback device in
particular. For example to address a pool of virtual guests
behind a load balancer.

This patch adds a new interface option 'route_localnet'
enabling routing of the 127/8 address block and processing
of ARP requests on a specific interface.

Note that for the feature to work, the default local route
covering 127/8 dev lo needs to be removed.

Example:
  $ sysctl -w net.ipv4.conf.eth0.route_localnet=1
  $ ip route del 127.0.0.0/8 dev lo table local
  $ ip addr add 127.1.0.1/16 dev eth0
  $ ip route flush cache

V2: Fix invalid check to auto flush cache (thanks davem)

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-12 15:25:46 -07:00
David S. Miller
c727e7f007 Merge branch 'delete-tokenring' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2012-05-16 01:02:40 -04:00
Joe Perches
91df42bedc net: ipv4 and ipv6: Convert printk(KERN_DEBUG to pr_debug
Use the current debugging style and enable dynamic_debug.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-16 01:01:03 -04:00
Paul Gortmaker
211ed86510 net: delete all instances of special processing for token ring
We are going to delete the Token ring support.  This removes any
special processing in the core networking for token ring, (aside
from net/tr.c itself), leaving the drivers and remaining tokenring
support present but inert.

The mass removal of the drivers and net/tr.c will be in a separate
commit, so that the history of these files that we still care
about won't have the giant deletion tied into their history.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-05-15 20:14:35 -04:00
Eric Dumazet
95c9617472 net: cleanup unsigned to unsigned int
Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 12:44:40 -04:00
David Howells
9ffc93f203 Remove all #inclusions of asm/system.h
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it.  Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28 18:30:03 +01:00
Neil Horman
124d37e9f0 arp: allow arp processing to honor per interface arp_accept sysctl
I found recently that the arp_process function which handles all of our received
arp frames, is using IPV4_DEVCONF_ALL macro to check the state of the arp_process
flag.  This seems wrong, as it implies that either none or all of the network
interfaces accept gratuitous arps.  This patch corrects that, allowing
per-interface arp_accept configuration to deviate from the all setting.  Note
this also brings us into line with the way the arp_filter setting is handled
during arp_process execution.

Tested this myself on my home network, and confirmed it works as expected.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-16 23:00:20 -07:00
Thomas Graf
70620c46ac net: Don't proxy arp respond if iif == rt->dst.dev if private VLAN is disabled
Commit 653241 (net: RFC3069, private VLAN proxy arp support) changed
the behavior of arp proxy to send arp replies back out on the interface
the request came in even if the private VLAN feature is disabled.

Previously we checked rt->dst.dev != skb->dev for in scenarios, when
proxy arp is enabled on for the netdevice and also when individual proxy
neighbour entries have been added.

This patch adds the check back for the pneigh_lookup() scenario.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-10 15:13:36 -05:00
David S. Miller
2c2aba6c56 ipv6: Use universal hash for NDISC.
In order to perform a proper universal hash on a vector of integers,
we have to use different universal hashes on each vector element.

Which means we need 4 different hash randoms for ipv6.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-28 15:06:58 -05:00
Igor Maravic
40e4783ee6 ipv4: arp: Cleanup in arp.c
Use "IS_ENABLED(CONFIG_FOO)" macro instead of
"defined(CONFIG_FOO) || defined(CONFIG_FOO_MODULE)"

Signed-off-by: Igor Maravic <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-06 00:34:40 -05:00
David Miller
32092ecf06 atm: clip: Use device neigh support on top of "arp_tbl".
Instead of instantiating an entire new neigh_table instance
just for ATM handling, use the neigh device private facility.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 18:51:03 -05:00
David Miller
76cc714ed5 neigh: Do not set tbl->entry_size in ipv4/ipv6 neigh tables.
Let the core self-size the neigh entry based upon the key length.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 18:46:43 -05:00
Herbert Xu
6608824329 ipv4: Remove all uses of LL_ALLOCATED_SPACE
ipv4: Remove all uses of LL_ALLOCATED_SPACE

The macro LL_ALLOCATED_SPACE was ill-conceived.  It applies the
alignment to the sum of needed_headroom and needed_tailroom.  As
the amount that is then reserved for head room is needed_headroom
with alignment, this means that the tail room left may be too small.

This patch replaces all uses of LL_ALLOCATED_SPACE in net/ipv4
with the macro LL_RESERVED_SPACE and direct reference to
needed_tailroom.

This also fixes the problem with needed_headroom changing between
allocating the skb and reserving the head room.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-18 14:37:08 -05:00
Eric Dumazet
8b5c171bb3 neigh: new unresolved queue limits
Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
>
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 09 Nov 2011 12:14:09 +0100
> >
> >> unres_qlen is the number of frames we are able to queue per unresolved
> >> neighbour. Its default value (3) was never changed and is responsible
> >> for strange drops, especially if IP fragments are used, or multiple
> >> sessions start in parallel. Even a single tcp flow can hit this limit.
> >  ...
> >
> > Ok, I've applied this, let's see what happens :-)
>
> Early answer, build fails.
>
> Please test build this patch with DECNET enabled and resubmit.  The
> decnet neigh layer still refers to the removed ->queue_len member.
>
> Thanks.

Ouch, this was fixed on one machine yesterday, but not the other one I
used this morning, sorry.

[PATCH V5 net-next] neigh: new unresolved queue limits

unres_qlen is the number of frames we are able to queue per unresolved
neighbour. Its default value (3) was never changed and is responsible
for strange drops, especially if IP fragments are used, or multiple
sessions start in parallel. Even a single tcp flow can hit this limit.

$ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-14 00:47:54 -05:00
David S. Miller
8f40b161de neigh: Pass neighbour entry to output ops.
This will get us closer to being able to do "neigh stuff"
completely independent of the underlying dst_entry for
protocols (ipv4/ipv6) that wish to do so.

We will also be able to make dst entries neigh-less.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-17 23:11:17 -07:00
David S. Miller
542d4d685f neigh: Kill ndisc_ops->queue_xmit
It is always dev_queue_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 18:30:59 -07:00
David S. Miller
47ec132a40 neigh: Kill neigh_ops->hh_output
It's always dev_queue_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 17:39:57 -07:00
David Miller
3769cffb1c ipv4: Inline neigh binding.
Get rid of all of the useless and costly indirection
by doing the neigh hash table lookup directly inside
of the neighbour binding.

Rename from arp_bind_neighbour to rt_bind_neighbour.

Use new helpers {__,}ipv4_neigh_lookup()

In rt_bind_neighbour() get rid of useless tests which
are never true in the context this function is called,
namely dev is never NULL and the dst->neighbour is
always NULL.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-13 01:12:28 -07:00
David S. Miller
f610b74b14 ipv4: Use universal hash for ARP.
We need to make sure the multiplier is odd.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-11 01:37:28 -07:00
Timo Teräs
93ca3bb5df net: gre: provide multicast mappings for ipv4 and ipv6
My commit 6d55cb91a0 (gre: fix hard header destination
address checking) broke multicast.

The reason is that ip_gre used to get ipgre_header() calls with
zero destination if we have NOARP or multicast destination. Instead
the actual target was decided at ipgre_tunnel_xmit() time based on
per-protocol dissection.

Instead of allowing the "abuse" of ->header() calls with invalid
destination, this creates multicast mappings for ip_gre. This also
fixes "ip neigh show nud noarp" to display the proper multicast
mappings used by the gre device.

Reported-by: Doug Kehn <rdkehn@yahoo.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Acked-by: Doug Kehn <rdkehn@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-30 00:10:47 -07:00
David S. Miller
78fbfd8a65 ipv4: Create and use route lookup helpers.
The idea here is this minimizes the number of places one has to edit
in order to make changes to how flows are defined and used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:42 -08:00
David S. Miller
b23dd4fe42 ipv4: Make output route lookup return rtable directly.
Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02 14:31:35 -08:00
Eric Dumazet
c506653d35 net: arp_ioctl() must hold RTNL
Commit 941666c2e3 "net: RCU conversion of dev_getbyhwaddr() and
arp_ioctl()" introduced a regression, reported by Jamie Heilman.
"arp -Ds 192.168.2.41 eth0 pub" triggered the ASSERT_RTNL() assert
in pneigh_lookup()

Removing RTNL requirement from arp_ioctl() was a mistake, just revert
that part.

Reported-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 13:16:16 -08:00
Maxim Levitsky
545ecdc3b3 arp: allow to invalidate specific ARP entries
IPv4 over firewire needs to be able to remove ARP entries
from the ARP cache that belong to nodes that are removed, because
IPv4 over firewire uses ARP packets for private information
about nodes.

This information becomes invalid as soon as node drops
off the bus and when it reconnects, its only possible
to start talking to it after it responded to an ARP packet.
But ARP cache prevents such packets from being sent.

Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-10 16:10:37 -08:00
Eric Dumazet
941666c2e3 net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()
Le dimanche 05 décembre 2010 à 09:19 +0100, Eric Dumazet a écrit :

> Hmm..
>
> If somebody can explain why RTNL is held in arp_ioctl() (and therefore
> in arp_req_delete()), we might first remove RTNL use in arp_ioctl() so
> that your patch can be applied.
>
> Right now it is not good, because RTNL wont be necessarly held when you
> are going to call arp_invalidate() ?

While doing this analysis, I found a refcount bug in llc, I'll send a
patch for net-2.6

Meanwhile, here is the patch for net-next-2.6

Your patch then can be applied after mine.

Thanks

[PATCH] net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()

dev_getbyhwaddr() was called under RTNL.

Rename it to dev_getbyhwaddr_rcu() and change all its caller to now use
RCU locking instead of RTNL.

Change arp_ioctl() to use RCU instead of RTNL locking.

Note: this fix a dev refcount bug in llc

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-08 10:07:24 -08:00
Changli Gao
ae9c416d68 net: arp: use assignment
Only when dont_send is 0, arp_filter() is consulted, so we can simply
assign the return value of arp_filter() to dont_send instead.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-06 12:59:09 -08:00
Changli Gao
5811662b15 net: use the macros defined for the members of flowi
Use the macros defined for the members of flowi to clean the code up.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-17 12:27:45 -08:00
Eric Dumazet
0ed8ddf404 neigh: Protect neigh->ha[] with a seqlock
Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
dirtying neighbour in stress situation (many different flows / dsts)

Dirtying takes place because of read_lock(&n->lock) and n->used writes.

Switching to a seqlock, and writing n->used only on jiffies changes
permits less dirtying.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 12:54:04 -07:00
Eric Dumazet
d6bf781712 net neigh: RCU conversion of neigh hash table
David

This is the first step for RCU conversion of neigh code.

Next patches will convert hash_buckets[] and "struct neighbour" to RCU
protected objects.

Thanks

[PATCH net-next] net neigh: RCU conversion of neigh hash table

Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
neigh_table", a new structure is defined :

struct neigh_hash_table {
       struct neighbour        **hash_buckets;
       unsigned int            hash_mask;
       __u32                   hash_rnd;
       struct rcu_head         rcu;
};

And "struct neigh_table" has an RCU protected pointer to such a
neigh_hash_table.

This means the signature of (*hash)() function changed: We need to add a
third parameter with the actual hash_rnd value, since this is not
anymore a neigh_table field.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05 14:54:36 -07:00
stephen hemminger
a64de47c09 arp: remove unnecessary export of arp_broken_ops
arp_broken_ops is only used in arp.c

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-29 19:45:35 -07:00
Eric Dumazet
a02cec2155 net: return operator cleanup
Change "return (EXPR);" to "return EXPR;"

return is not a function, parentheses are not required.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-23 14:33:39 -07:00
Changli Gao
deffd77759 net: arp: code cleanup
Clean the code up according to Documentation/CodingStyle.

Don't initialize the variable dont_send in arp_process().

Remove the temporary varialbe flags in arp_state_to_flags().

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02 10:12:06 -07:00
Eric Dumazet
4bc2f18ba4 net/ipv4: EXPORT_SYMBOL cleanups
CodingStyle cleanups

EXPORT_SYMBOL should immediately follow the symbol declaration.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12 12:57:54 -07:00
Eric Dumazet
4b4194c40f arp: RCU change in arp_solicit()
Avoid two atomic ops in arp_solicit()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25 21:33:16 -07:00
Changli Gao
d8d1f30b95 net-next: remove useless union keyword
remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-10 23:31:35 -07:00
Eric Dumazet
faa9dcf793 arp: RCU changes
Avoid two atomic ops in arp_fwd_proxy()

Avoid two atomic ops in arp_process()

Valid optims since arp_rcv() is run under rcu_read_lock()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-03 20:01:52 -07:00
Eric Dumazet
4a94445c9a net: Use ip_route_input_noref() in input path
Use ip_route_input_noref() in ip fast path, to avoid two atomic ops per
incoming packet.

Note: loopback is excluded from this optimization in ip_rcv_finish()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-17 17:18:51 -07:00
David S. Miller
f0ecde1466 net: Fix FDDI and TR config checks in ipv4 arp and LLC.
Need to check both CONFIG_FOO and CONFIG_FOO_MODULE

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-10 04:59:07 -07:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Eric W. Biederman
54716e3beb net neigh: Decouple per interface neighbour table controls from binary sysctls
Stop computing the number of neighbour table settings we have by
counting the number of binary sysctls.  This behaviour was silly
and meant that we could not add another neighbour table setting
without also adding another binary sysctl.

Don't pass the binary sysctl path for neighour table entries
into neigh_sysctl_register.  These parameters are no longer
used and so are just dead code.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-16 15:55:18 -08:00
Octavian Purdila
6d955180b2 ipv4: allow warming up the ARP cache with request type gratuitous ARP
If the per device ARP_ACCEPT option is enable, currently we only allow
creating new ARP cache entries for response type gratuitous ARP.

Allowing gratuitous ARP to create new ARP entries (not only to update
existing ones) is useful when we want to avoid unnecessary delays for
the first packet of a stream.

This patch allows request type gratuitous ARP to create new ARP cache
entries as well. This is useful when we want to populate the ARP cache
entries for a large number of hosts on the same LAN.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-19 02:12:34 -08:00
Jesper Dangaard Brouer
65324144b5 net: RFC3069, private VLAN proxy arp support
This is to be used together with switch technologies, like RFC3069,
that where the individual ports are not allowed to communicate with
each other, but they are allowed to talk to the upstream router.  As
described in RFC 3069, it is possible to allow these hosts to
communicate through the upstream router by proxy_arp'ing.

This patch basically allow proxy arp replies back to the same
interface (from which the ARP request/solicitation was received).

Tunable per device via proc "proxy_arp_pvlan":
  /proc/sys/net/ipv4/conf/*/proxy_arp_pvlan

This switch technology is known by different vendor names:
 - In RFC 3069 it is called VLAN Aggregation.
 - Cisco and Allied Telesyn call it Private VLAN.
 - Hewlett-Packard call it Source-Port filtering or port-isolation.
 - Ericsson call it MAC-Forced Forwarding (RFC Draft).

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-07 00:59:09 -08:00
Eric W. Biederman
f8572d8f2a sysctl net: Remove unused binary sysctl code
Now that sys_sysctl is a compatiblity wrapper around /proc/sys
all sysctl strategy routines, and all ctl_name and strategy
entries in the sysctl tables are unused, and can be
revmoed.

In addition neigh_sysctl_register has been modified to no longer
take a strategy argument and it's callers have been modified not
to pass one.

Cc: "David Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2009-11-12 02:05:06 -08:00
Stephen Hemminger
89d69d2b75 net: make neigh_ops constant
These tables are never modified at runtime. Move to read-only
section.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-01 17:40:57 -07:00
roel kluin
a3e8ee6820 ipv4: ARP neigh procfs buffer overflow
If arp_format_neigh_entry() can be called with n->dev->addr_len == 0, then a
write to hbuffer[-1] occurs.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-30 13:27:29 -07:00
Eric W. Biederman
f8a68e752b Revert "ipv4: arp announce, arp_proxy and windows ip conflict verification"
This reverts commit 73ce7b01b4.

After discovering that we don't listen to gratuitious arps in 2.6.30
I tracked the failure down to this commit.

The patch makes absolutely no sense.  RFC2131 RFC3927 and RFC5227.
are all in agreement that an arp request with sip == 0 should be used
for the probe (to prevent learning) and an arp request with sip == tip
should be used for the gratitous announcement that people can learn
from.

It appears the author of the broken patch got those two cases confused
and modified the code to drop all gratuitous arp traffic.  Ouch!

Cc: stable@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-30 19:47:08 -07:00
Eric Dumazet
adf30907d6 net: skb->dst accessors
Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03 02:51:04 -07:00
Eric Dumazet
511c3f92ad net: skb->rtable accessor
Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

Delete skb->rtable field

Setting rtable is not allowed, just set dst instead as rtable is an alias.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-03 02:51:02 -07:00
Denys Fedoryshchenko
73ce7b01b4 ipv4: arp announce, arp_proxy and windows ip conflict verification
Windows (XP at least) hosts on boot, with configured static ip, performing 
address conflict detection, which is defined in RFC3927.
Here is quote of important information:

"
An ARP announcement is identical to the ARP Probe described above, 
except    that now the sender and target IP addresses are both set 
to the host's newly selected IPv4 address. 
"

But it same time this goes wrong with RFC5227.
"
The 'sender IP address' field MUST be set to all zeroes; this is to avoid
polluting ARP caches in other hosts on the same link in the case
where the address turns out to be already in use by another host.
"

When ARP proxy configured, it must not answer to both cases, because 
it is address conflict verification in any case. For Windows it is just 
causing to detect false "ip conflict". Already there is code for RFC5227, so 
just trivially we just check also if source ip == target ip.

Signed-off-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-13 16:02:07 -07:00
Neil Horman
ead2ceb0ec Network Drop Monitor: Adding kfree_skb_clean for non-drops and modifying end-of-line points for skbs
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

 include/linux/skbuff.h |    4 +++-
 net/core/datagram.c    |    2 +-
 net/core/skbuff.c      |   22 ++++++++++++++++++++++
 net/ipv4/arp.c         |    2 +-
 net/ipv4/udp.c         |    2 +-
 net/packet/af_packet.c |    2 +-
 6 files changed, 29 insertions(+), 5 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-13 12:09:28 -07:00
Stephen Hemminger
7546dd97d2 net: convert usage of packet_type to read_mostly
Protocols that use packet_type can be __read_mostly section for better
locality. Elminate any unnecessary initializations of NULL.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-10 05:22:43 -07:00
Harvey Harrison
09640e6365 net: replace uses of __constant_{endian}
Base versions handle constant folding now.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-01 00:45:17 -08:00
Ben Greear
8164f1b797 ipv4: Fix ARP behavior with many mac-vlans
Ben Greear wrote:
> I have 500 mac-vlans on a system talking to 500 other
> mac-vlans.  My problem is that the arp-table gets extremely
> huge because every time an arp-request comes in on all mac-vlans,
> a stale arp entry is added for each mac-vlan.  I have filtering
> turned on, but that doesn't help because the neigh_event_ns call
> below will cause a stale neighbor entry to be created regardless
> of whether a replay will be sent or not.
> Maybe the neigh_event code should be below the checks for dont_send,
> and only create check neigh_event_ns if we are !dont_send?

The attached patch makes it work much better for me.  The patch
will cause the code to NOT create a stale neighbor entry if we
are not going to respond to the ARP request.  The old code
*would* create a stale entry even if we are not going to respond.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-16 19:19:38 -08:00
Jianjun Kong
f4cca7ffb2 net: clean up net/ipv4/pararp.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 02:48:14 -08:00
Harvey Harrison
673d57e723 net: replace NIPQUAD() in net/ipv4/ net/ipv6/
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31 00:53:57 -07:00
Jan Engelhardt
fdc9314cbe netfilter: replace old NF_ARP calls with NFPROTO_ARP
(Supplements: ee999d8b95)

NFPROTO_ARP actually has a different value from NF_ARP, so ensure all
callers use the new value so that packets _do_ get delivered to the
registered hooks.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-20 03:34:51 -07:00
Pavel Emelyanov
de0744af1f mib: add net to NET_INC_STATS_BH
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:31:16 -07:00
Pavel Emelyanov
ca12a1a443 inet: prepare net on the stack for NET accounting macros
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:28:42 -07:00
Denis V. Lunev
76e6ebfb40 netns: add namespace parameter to rt_cache_flush
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-05 19:00:44 -07:00
Adrian Bunk
0b04082995 net: remove CVS keywords
This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-11 21:00:38 -07:00
Denis Cheng
51f82a2b12 net/ipv4/arp.c: Use common hex_asc helpers
Here the local hexbuf is a duplicate of global const char hex_asc from
lib/hexdump.c, except the hex letters' cases:

	const char hexbuf[] = "0123456789ABCDEF";

	const char hex_asc[] = "0123456789abcdef";

and here to print HW addresses, the hex cases are not significant.

Thanks to Harvey Harrison to introduce the hex_asc_hi/hex_asc_lo helpers.

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-21 17:34:32 -07:00
Johannes Berg
f5184d267c net: Allow netdevices to specify needed head/tailroom
This patch adds needed_headroom/needed_tailroom members to struct
net_device and updates many places that allocate sbks to use them. Not
all of them can be converted though, and I'm sure I missed some (I
mostly grepped for LL_RESERVED_SPACE)

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-12 20:48:31 -07:00
YOSHIFUJI Hideaki
a7d632b6b4 [IPV4]: Use NIPQUAD_FMT to format ipv4 addresses.
And use %u to format port.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-14 04:09:00 -07:00
YOSHIFUJI Hideaki
c346dca108 [NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS.
Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:53 +09:00