2005-12-27 02:43:12 -02:00
|
|
|
/*
|
|
|
|
* INET An implementation of the TCP/IP protocol suite for the LINUX
|
|
|
|
* operating system. INET is implemented using the BSD Socket
|
|
|
|
* interface as the means of communication with the user level.
|
|
|
|
*
|
|
|
|
* Definitions for inet_sock
|
|
|
|
*
|
|
|
|
* Authors: Many, reorganised here by
|
|
|
|
* Arnaldo Carvalho de Melo <acme@mandriva.com>
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
*/
|
|
|
|
#ifndef _INET_SOCK_H
|
|
|
|
#define _INET_SOCK_H
|
|
|
|
|
|
|
|
|
2008-09-09 06:43:12 +02:00
|
|
|
#include <linux/kmemcheck.h>
|
2005-12-27 02:43:12 -02:00
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/types.h>
|
2007-03-23 11:40:27 -07:00
|
|
|
#include <linux/jhash.h>
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
|
|
|
#include <linux/netdevice.h>
|
2005-12-27 02:43:12 -02:00
|
|
|
|
|
|
|
#include <net/flow.h>
|
|
|
|
#include <net/sock.h>
|
|
|
|
#include <net/request_sock.h>
|
2008-06-16 17:14:11 -07:00
|
|
|
#include <net/netns/hash.h>
|
2005-12-27 02:43:12 -02:00
|
|
|
|
|
|
|
/** struct ip_options - IP Options
|
|
|
|
*
|
|
|
|
* @faddr - Saved first hop address
|
2011-11-22 23:33:10 +00:00
|
|
|
* @nexthop - Saved nexthop address in LSRR and SSRR
|
2005-12-27 02:43:12 -02:00
|
|
|
* @is_data - Options in __data, rather than skb
|
|
|
|
* @is_strictroute - Strict source route
|
|
|
|
* @srr_is_hit - Packet destination addr was our one
|
|
|
|
* @is_changed - IP checksum more not valid
|
|
|
|
* @rr_needaddr - Need to record addr of outgoing dev
|
|
|
|
* @ts_needtime - Need to record timestamp
|
|
|
|
* @ts_needaddr - Need to record addr of outgoing dev
|
|
|
|
*/
|
|
|
|
struct ip_options {
|
2006-09-27 18:28:07 -07:00
|
|
|
__be32 faddr;
|
2011-11-22 23:33:10 +00:00
|
|
|
__be32 nexthop;
|
2005-12-27 02:43:12 -02:00
|
|
|
unsigned char optlen;
|
|
|
|
unsigned char srr;
|
|
|
|
unsigned char rr;
|
|
|
|
unsigned char ts;
|
2008-03-22 16:35:29 -07:00
|
|
|
unsigned char is_strictroute:1,
|
2005-12-27 02:43:12 -02:00
|
|
|
srr_is_hit:1,
|
|
|
|
is_changed:1,
|
|
|
|
rr_needaddr:1,
|
|
|
|
ts_needtime:1,
|
|
|
|
ts_needaddr:1;
|
|
|
|
unsigned char router_alert;
|
2006-08-03 16:46:20 -07:00
|
|
|
unsigned char cipso;
|
2005-12-27 02:43:12 -02:00
|
|
|
unsigned char __pad2;
|
|
|
|
unsigned char __data[0];
|
|
|
|
};
|
|
|
|
|
2011-04-21 09:45:37 +00:00
|
|
|
struct ip_options_rcu {
|
|
|
|
struct rcu_head rcu;
|
|
|
|
struct ip_options opt;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct ip_options_data {
|
|
|
|
struct ip_options_rcu opt;
|
|
|
|
char data[40];
|
|
|
|
};
|
2005-12-27 02:43:12 -02:00
|
|
|
|
|
|
|
struct inet_request_sock {
|
|
|
|
struct request_sock req;
|
2011-12-10 09:48:31 +00:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2005-12-27 02:43:12 -02:00
|
|
|
u16 inet6_rsk_offset;
|
|
|
|
#endif
|
2008-10-01 07:46:49 -07:00
|
|
|
__be16 loc_port;
|
2006-09-27 18:27:13 -07:00
|
|
|
__be32 loc_addr;
|
|
|
|
__be32 rmt_addr;
|
2006-09-27 18:35:29 -07:00
|
|
|
__be16 rmt_port;
|
2008-09-09 06:43:12 +02:00
|
|
|
kmemcheck_bitfield_begin(flags);
|
|
|
|
u16 snd_wscale : 4,
|
|
|
|
rcv_wscale : 4,
|
2005-12-27 02:43:12 -02:00
|
|
|
tstamp_ok : 1,
|
|
|
|
sack_ok : 1,
|
|
|
|
wscale_ok : 1,
|
|
|
|
ecn_ok : 1,
|
2008-10-01 07:41:00 -07:00
|
|
|
acked : 1,
|
|
|
|
no_srccheck: 1;
|
2008-09-09 06:43:12 +02:00
|
|
|
kmemcheck_bitfield_end(flags);
|
2011-04-21 09:45:37 +00:00
|
|
|
struct ip_options_rcu *opt;
|
2005-12-27 02:43:12 -02:00
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
|
|
|
|
{
|
|
|
|
return (struct inet_request_sock *)sk;
|
|
|
|
}
|
|
|
|
|
2011-03-01 02:36:47 +00:00
|
|
|
struct inet_cork {
|
|
|
|
unsigned int flags;
|
2011-05-06 15:02:07 -07:00
|
|
|
__be32 addr;
|
2011-03-01 02:36:47 +00:00
|
|
|
struct ip_options *opt;
|
2011-05-06 15:02:07 -07:00
|
|
|
unsigned int fragsize;
|
2011-03-01 02:36:47 +00:00
|
|
|
int length; /* Total length of all frames */
|
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-23 23:04:42 +00:00
|
|
|
struct dst_entry *dst;
|
2011-03-01 02:36:47 +00:00
|
|
|
u8 tx_flags;
|
|
|
|
};
|
|
|
|
|
2011-05-06 15:02:07 -07:00
|
|
|
struct inet_cork_full {
|
|
|
|
struct inet_cork base;
|
|
|
|
struct flowi fl;
|
|
|
|
};
|
|
|
|
|
2005-12-27 02:43:12 -02:00
|
|
|
struct ip_mc_socklist;
|
|
|
|
struct ipv6_pinfo;
|
|
|
|
struct rtable;
|
|
|
|
|
|
|
|
/** struct inet_sock - representation of INET sockets
|
|
|
|
*
|
|
|
|
* @sk - ancestor class
|
|
|
|
* @pinet6 - pointer to IPv6 control block
|
2009-10-15 06:30:45 +00:00
|
|
|
* @inet_daddr - Foreign IPv4 addr
|
|
|
|
* @inet_rcv_saddr - Bound local IPv4 addr
|
|
|
|
* @inet_dport - Destination port
|
|
|
|
* @inet_num - Local port
|
|
|
|
* @inet_saddr - Sending source
|
2005-12-27 02:43:12 -02:00
|
|
|
* @uc_ttl - Unicast TTL
|
2009-10-15 06:30:45 +00:00
|
|
|
* @inet_sport - Source port
|
|
|
|
* @inet_id - ID counter for DF pkts
|
2005-12-27 02:43:12 -02:00
|
|
|
* @tos - TOS
|
|
|
|
* @mc_ttl - Multicasting TTL
|
|
|
|
* @is_icsk - is this an inet_connection_sock?
|
2012-02-08 09:11:07 +00:00
|
|
|
* @uc_index - Unicast outgoing device index
|
2005-12-27 02:43:12 -02:00
|
|
|
* @mc_index - Multicast device index
|
|
|
|
* @mc_list - Group array
|
|
|
|
* @cork - info to build ip hdr on each ip frag while socket is corked
|
|
|
|
*/
|
|
|
|
struct inet_sock {
|
|
|
|
/* sk and pinet6 has to be the first two members of inet_sock */
|
|
|
|
struct sock sk;
|
2011-12-10 09:48:31 +00:00
|
|
|
#if IS_ENABLED(CONFIG_IPV6)
|
2005-12-27 02:43:12 -02:00
|
|
|
struct ipv6_pinfo *pinet6;
|
|
|
|
#endif
|
|
|
|
/* Socket demultiplex comparisons on incoming packets. */
|
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 19:04:07 +00:00
|
|
|
#define inet_daddr sk.__sk_common.skc_daddr
|
|
|
|
#define inet_rcv_saddr sk.__sk_common.skc_rcv_saddr
|
|
|
|
|
2009-10-15 06:30:45 +00:00
|
|
|
__be16 inet_dport;
|
|
|
|
__u16 inet_num;
|
|
|
|
__be32 inet_saddr;
|
2005-12-27 02:43:12 -02:00
|
|
|
__s16 uc_ttl;
|
|
|
|
__u16 cmsg_flags;
|
2009-10-15 06:30:45 +00:00
|
|
|
__be16 inet_sport;
|
|
|
|
__u16 inet_id;
|
2010-01-11 16:28:01 -08:00
|
|
|
|
2011-04-21 09:45:37 +00:00
|
|
|
struct ip_options_rcu __rcu *inet_opt;
|
2005-12-27 02:43:12 -02:00
|
|
|
__u8 tos;
|
2010-01-11 16:28:01 -08:00
|
|
|
__u8 min_ttl;
|
2005-12-27 02:43:12 -02:00
|
|
|
__u8 mc_ttl;
|
|
|
|
__u8 pmtudisc;
|
|
|
|
__u8 recverr:1,
|
|
|
|
is_icsk:1,
|
|
|
|
freebind:1,
|
|
|
|
hdrincl:1,
|
2008-10-01 07:30:02 -07:00
|
|
|
mc_loop:1,
|
2009-05-28 07:00:46 +00:00
|
|
|
transparent:1,
|
2010-06-15 01:07:31 +00:00
|
|
|
mc_all:1,
|
|
|
|
nodefrag:1;
|
2012-02-09 09:35:49 +00:00
|
|
|
__u8 rcv_tos;
|
2012-02-08 09:11:07 +00:00
|
|
|
int uc_index;
|
2005-12-27 02:43:12 -02:00
|
|
|
int mc_index;
|
2006-09-26 21:27:35 -07:00
|
|
|
__be32 mc_addr;
|
2012-07-23 16:29:00 -07:00
|
|
|
int rx_dst_ifindex;
|
2010-11-12 05:46:50 +00:00
|
|
|
struct ip_mc_socklist __rcu *mc_list;
|
2011-05-06 15:02:07 -07:00
|
|
|
struct inet_cork_full cork;
|
2005-12-27 02:43:12 -02:00
|
|
|
};
|
|
|
|
|
|
|
|
#define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */
|
|
|
|
#define IPCORK_ALLFRAG 2 /* always fragment (for ipv6 for now) */
|
|
|
|
|
|
|
|
static inline struct inet_sock *inet_sk(const struct sock *sk)
|
|
|
|
{
|
|
|
|
return (struct inet_sock *)sk;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void __inet_sk_copy_descendant(struct sock *sk_to,
|
|
|
|
const struct sock *sk_from,
|
|
|
|
const int ancestor_size)
|
|
|
|
{
|
|
|
|
memcpy(inet_sk(sk_to) + 1, inet_sk(sk_from) + 1,
|
|
|
|
sk_from->sk_prot->obj_size - ancestor_size);
|
|
|
|
}
|
2011-12-10 09:48:31 +00:00
|
|
|
#if !(IS_ENABLED(CONFIG_IPV6))
|
2005-12-27 02:43:12 -02:00
|
|
|
static inline void inet_sk_copy_descendant(struct sock *sk_to,
|
|
|
|
const struct sock *sk_from)
|
|
|
|
{
|
|
|
|
__inet_sk_copy_descendant(sk_to, sk_from, sizeof(struct inet_sock));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
extern int inet_sk_rebuild_header(struct sock *sk);
|
|
|
|
|
2007-03-23 11:40:27 -07:00
|
|
|
extern u32 inet_ehash_secret;
|
|
|
|
extern void build_ehash_secret(void);
|
|
|
|
|
2008-06-16 17:13:27 -07:00
|
|
|
static inline unsigned int inet_ehashfn(struct net *net,
|
|
|
|
const __be32 laddr, const __u16 lport,
|
2006-09-27 18:43:33 -07:00
|
|
|
const __be32 faddr, const __be16 fport)
|
2005-12-27 02:43:12 -02:00
|
|
|
{
|
2008-03-04 14:28:41 -08:00
|
|
|
return jhash_3words((__force __u32) laddr,
|
|
|
|
(__force __u32) faddr,
|
2007-03-23 11:40:27 -07:00
|
|
|
((__u32) lport) << 16 | (__force __u32)fport,
|
2008-06-16 17:14:11 -07:00
|
|
|
inet_ehash_secret + net_hash_mix(net));
|
2005-12-27 02:43:12 -02:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int inet_sk_ehashfn(const struct sock *sk)
|
|
|
|
{
|
|
|
|
const struct inet_sock *inet = inet_sk(sk);
|
2009-10-15 06:30:45 +00:00
|
|
|
const __be32 laddr = inet->inet_rcv_saddr;
|
|
|
|
const __u16 lport = inet->inet_num;
|
|
|
|
const __be32 faddr = inet->inet_daddr;
|
|
|
|
const __be16 fport = inet->inet_dport;
|
2008-06-16 17:13:27 -07:00
|
|
|
struct net *net = sock_net(sk);
|
2005-12-27 02:43:12 -02:00
|
|
|
|
2008-06-16 17:13:27 -07:00
|
|
|
return inet_ehashfn(net, laddr, lport, faddr, fport);
|
2005-12-27 02:43:12 -02:00
|
|
|
}
|
|
|
|
|
2008-06-10 12:39:35 -07:00
|
|
|
static inline struct request_sock *inet_reqsk_alloc(struct request_sock_ops *ops)
|
|
|
|
{
|
|
|
|
struct request_sock *req = reqsk_alloc(ops);
|
2008-09-09 06:43:12 +02:00
|
|
|
struct inet_request_sock *ireq = inet_rsk(req);
|
2008-06-10 12:39:35 -07:00
|
|
|
|
2008-09-09 06:43:12 +02:00
|
|
|
if (req != NULL) {
|
|
|
|
kmemcheck_annotate_bitfield(ireq, flags);
|
|
|
|
ireq->opt = NULL;
|
|
|
|
}
|
2008-06-10 12:39:35 -07:00
|
|
|
|
|
|
|
return req;
|
|
|
|
}
|
|
|
|
|
2008-10-01 07:41:00 -07:00
|
|
|
static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
|
|
|
|
{
|
2011-01-27 22:01:53 -08:00
|
|
|
__u8 flags = 0;
|
|
|
|
|
2011-08-07 09:16:09 +00:00
|
|
|
if (inet_sk(sk)->transparent || inet_sk(sk)->hdrincl)
|
2011-01-27 22:01:53 -08:00
|
|
|
flags |= FLOWI_FLAG_ANYSRC;
|
|
|
|
return flags;
|
2008-10-01 07:41:00 -07:00
|
|
|
}
|
|
|
|
|
2005-12-27 02:43:12 -02:00
|
|
|
#endif /* _INET_SOCK_H */
|