License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2007-10-15 02:24:19 -07:00
|
|
|
#ifndef __NET_FRAG_H__
|
|
|
|
#define __NET_FRAG_H__
|
|
|
|
|
2018-06-18 12:52:50 +10:00
|
|
|
#include <linux/rhashtable-types.h>
|
2019-05-27 16:56:49 -07:00
|
|
|
#include <linux/completion.h>
|
2022-07-20 16:57:58 -07:00
|
|
|
#include <linux/in6.h>
|
|
|
|
#include <linux/rbtree_types.h>
|
|
|
|
#include <linux/refcount.h>
|
2023-04-19 14:52:52 +02:00
|
|
|
#include <net/dropreason-core.h>
|
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 12:58:49 -07:00
|
|
|
|
2019-05-24 09:03:30 -07:00
|
|
|
/* Per netns frag queues directory */
|
|
|
|
struct fqdir {
|
2008-01-22 06:09:37 -08:00
|
|
|
/* sysctls */
|
2018-03-31 12:58:53 -07:00
|
|
|
long high_thresh;
|
|
|
|
long low_thresh;
|
2008-01-22 06:09:37 -08:00
|
|
|
int timeout;
|
2016-02-15 12:11:31 +02:00
|
|
|
int max_dist;
|
2018-03-31 12:58:44 -07:00
|
|
|
struct inet_frags *f;
|
2019-05-24 09:03:38 -07:00
|
|
|
struct net *net;
|
2019-05-24 09:03:40 -07:00
|
|
|
bool dead;
|
2018-03-31 12:58:57 -07:00
|
|
|
|
|
|
|
struct rhashtable rhashtable ____cacheline_aligned_in_smp;
|
|
|
|
|
|
|
|
/* Keep atomic mem on separate cachelines in structs that include it */
|
|
|
|
atomic_long_t mem ____cacheline_aligned_in_smp;
|
2019-06-18 11:09:00 -07:00
|
|
|
struct work_struct destroy_work;
|
inet: frags: batch fqdir destroy works
On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase. As a result, memory pressure occurs.
In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster. It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes
about 2 minutes. On 40 CPU cores / 70GB DRAM machine, the available
memory continuously reduced in a fast speed (about 120MB per second,
15GB in total within the 2 minutes). Note that the issue don't
reproduce on every machine. On my 6 CPU cores machine, the problem
didn't reproduce.
'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects. They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'. Therefore, 'fqdir_work_fn()' called frequently under the
workload and made the contention for 'rcu_barrier()' high. In more
detail, the global mutex, 'rcu_state.barrier_mutex' became the
bottleneck.
This commit avoids such contention by doing the 'rcu_barrier()' and
subsequent lightweight works in a batched manner, as similar to that of
'cleanup_net()'. The fqdir hashtable destruction, which is done before
the 'rcu_barrier()', is still allowed to run in parallel for fast
processing, but this commit makes it to use a dedicated work queue
instead of the 'system_wq', to make sure that the number of threads is
bounded.
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11 12:24:05 +01:00
|
|
|
struct llist_node free_list;
|
2008-01-22 06:02:14 -08:00
|
|
|
};
|
|
|
|
|
2014-08-01 12:29:45 +02:00
|
|
|
/**
|
2023-07-13 21:51:23 -07:00
|
|
|
* enum: fragment queue flags
|
2014-08-01 12:29:45 +02:00
|
|
|
*
|
|
|
|
* @INET_FRAG_FIRST_IN: first fragment has arrived
|
|
|
|
* @INET_FRAG_LAST_IN: final fragment has arrived
|
|
|
|
* @INET_FRAG_COMPLETE: frag queue has been processed and is due for destruction
|
2019-05-24 09:03:40 -07:00
|
|
|
* @INET_FRAG_HASH_DEAD: inet_frag_kill() has not removed fq from rhashtable
|
2022-10-29 15:45:19 +00:00
|
|
|
* @INET_FRAG_DROP: if skbs must be dropped (instead of being consumed)
|
2014-08-01 12:29:45 +02:00
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
INET_FRAG_FIRST_IN = BIT(0),
|
|
|
|
INET_FRAG_LAST_IN = BIT(1),
|
|
|
|
INET_FRAG_COMPLETE = BIT(2),
|
2019-05-24 09:03:40 -07:00
|
|
|
INET_FRAG_HASH_DEAD = BIT(3),
|
2022-10-29 15:45:19 +00:00
|
|
|
INET_FRAG_DROP = BIT(4),
|
2014-08-01 12:29:45 +02:00
|
|
|
};
|
|
|
|
|
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 12:58:49 -07:00
|
|
|
struct frag_v4_compare_key {
|
|
|
|
__be32 saddr;
|
|
|
|
__be32 daddr;
|
|
|
|
u32 user;
|
|
|
|
u32 vif;
|
|
|
|
__be16 id;
|
|
|
|
u16 protocol;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct frag_v6_compare_key {
|
|
|
|
struct in6_addr saddr;
|
|
|
|
struct in6_addr daddr;
|
|
|
|
u32 user;
|
|
|
|
__be32 id;
|
|
|
|
u32 iif;
|
|
|
|
};
|
|
|
|
|
2014-08-01 12:29:45 +02:00
|
|
|
/**
|
|
|
|
* struct inet_frag_queue - fragment queue
|
|
|
|
*
|
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 12:58:49 -07:00
|
|
|
* @node: rhash node
|
|
|
|
* @key: keys identifying this frag.
|
2014-08-01 12:29:45 +02:00
|
|
|
* @timer: queue expiration timer
|
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 12:58:49 -07:00
|
|
|
* @lock: spinlock protecting this frag
|
2014-08-01 12:29:45 +02:00
|
|
|
* @refcnt: reference count of the queue
|
2018-08-11 20:27:24 +00:00
|
|
|
* @rb_fragments: received fragments rb-tree root
|
2014-08-01 12:29:45 +02:00
|
|
|
* @fragments_tail: received fragments tail
|
2018-08-11 20:27:24 +00:00
|
|
|
* @last_run_head: the head of the last "run". see ip_fragment.c
|
2014-08-01 12:29:45 +02:00
|
|
|
* @stamp: timestamp of the last received fragment
|
|
|
|
* @len: total length of the original datagram
|
|
|
|
* @meat: length of received fragments so far
|
2024-05-09 14:18:32 -07:00
|
|
|
* @tstamp_type: stamp has a mono delivery time (EDT)
|
2014-08-01 12:29:45 +02:00
|
|
|
* @flags: fragment queue flags
|
2015-05-22 16:32:51 +02:00
|
|
|
* @max_size: maximum received fragment size
|
2019-05-24 09:03:30 -07:00
|
|
|
* @fqdir: pointer to struct fqdir
|
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 12:58:49 -07:00
|
|
|
* @rcu: rcu head for freeing deferall
|
2014-08-01 12:29:45 +02:00
|
|
|
*/
|
2007-10-15 02:24:19 -07:00
|
|
|
struct inet_frag_queue {
|
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 12:58:49 -07:00
|
|
|
struct rhash_head node;
|
|
|
|
union {
|
|
|
|
struct frag_v4_compare_key v4;
|
|
|
|
struct frag_v6_compare_key v6;
|
|
|
|
} key;
|
2014-08-01 12:29:45 +02:00
|
|
|
struct timer_list timer;
|
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 12:58:49 -07:00
|
|
|
spinlock_t lock;
|
2017-06-30 13:08:07 +03:00
|
|
|
refcount_t refcnt;
|
2019-02-25 17:43:46 -08:00
|
|
|
struct rb_root rb_fragments;
|
2010-06-29 04:39:37 +00:00
|
|
|
struct sk_buff *fragments_tail;
|
2018-08-11 20:27:24 +00:00
|
|
|
struct sk_buff *last_run_head;
|
2007-10-15 02:24:19 -07:00
|
|
|
ktime_t stamp;
|
2014-08-01 12:29:45 +02:00
|
|
|
int len;
|
2007-10-15 02:24:19 -07:00
|
|
|
int meat;
|
2024-05-09 14:18:32 -07:00
|
|
|
u8 tstamp_type;
|
2014-08-01 12:29:45 +02:00
|
|
|
__u8 flags;
|
2012-08-26 19:13:55 +02:00
|
|
|
u16 max_size;
|
2019-05-24 09:03:30 -07:00
|
|
|
struct fqdir *fqdir;
|
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 12:58:49 -07:00
|
|
|
struct rcu_head rcu;
|
net: frag queue per hash bucket locking
This patch implements per hash bucket locking for the frag queue
hash. This removes two write locks, and the only remaining write
lock is for protecting hash rebuild. This essentially reduce the
readers-writer lock to a rebuild lock.
This patch is part of "net: frag performance followup"
http://thread.gmane.org/gmane.linux.network/263644
of which two patches have already been accepted:
Same test setup as previous:
(http://thread.gmane.org/gmane.linux.network/257155)
Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses
Ethernet flow-control. A third interface is used for generating the
DoS attack (with trafgen).
Notice, I have changed the frag DoS generator script to be more
efficient/deadly. Before it would only hit one RX queue, now its
sending packets causing multi-queue RX, due to "better" RX hashing.
Test types summary (netperf UDP_STREAM):
Test-20G64K == 2x10G with 65K fragments
Test-20G3F == 2x10G with 3x fragments (3*1472 bytes)
Test-20G64K+DoS == Same as 20G64K with frag DoS
Test-20G3F+DoS == Same as 20G3F with frag DoS
Test-20G64K+MQ == Same as 20G64K with Multi-Queue frag DoS
Test-20G3F+MQ == Same as 20G3F with Multi-Queue frag DoS
When I rebased this-patch(03) (on top of net-next commit a210576c) and
removed the _bh spinlock, I saw a performance regression. BUT this
was caused by some unrelated change in-between. See tests below.
Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de.
Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02.
Test (C) is what I reported before for this-patch
Test (D) is net-next master HEAD (commit a210576c), which reveals some
(unknown) performance regression (compared against test (B)).
Test (D) function as a new base-test.
Performance table summary (in Mbit/s):
(#) Test-type: 20G64K 20G3F 20G64K+DoS 20G3F+DoS 20G64K+MQ 20G3F+MQ
---------- ------- ------- ---------- --------- -------- -------
(A) Patch-02 : 18848.7 13230.1 4103.04 5310.36 130.0 440.2
(B) 1b5ab0de : 18841.5 13156.8 4101.08 5314.57 129.0 424.2
(C) Patch-03v1: 18838.0 13490.5 4405.11 6814.72 196.6 461.6
(D) a210576c : 18321.5 11250.4 3635.34 5160.13 119.1 405.2
(E) with _bh : 17247.3 11492.6 3994.74 6405.29 166.7 413.6
(F) without bh: 17471.3 11298.7 3818.05 6102.11 165.7 406.3
Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks.
I cannot explain the slow down for 20G64K (but its an artificial
"lab-test" so I'm not worried). But the other results does show
improvements. And test (E) "with _bh" version is slightly better.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
----
V2:
- By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
need the spinlock _bh versions, as Netfilter currently does a
local_bh_disable() before entering inet_fragment.
- Fold-in desc from cover-mail
V3:
- Drop the chain_len counter per hash bucket.
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-03 23:38:16 +00:00
|
|
|
};
|
|
|
|
|
2007-10-15 02:31:52 -07:00
|
|
|
struct inet_frags {
|
2017-05-23 00:20:26 +03:00
|
|
|
unsigned int qsize;
|
2007-10-15 02:38:08 -07:00
|
|
|
|
2007-10-17 19:46:47 -07:00
|
|
|
void (*constructor)(struct inet_frag_queue *q,
|
2014-07-24 16:50:29 +02:00
|
|
|
const void *arg);
|
2007-10-15 02:39:14 -07:00
|
|
|
void (*destructor)(struct inet_frag_queue *);
|
2017-10-16 17:29:20 -07:00
|
|
|
void (*frag_expire)(struct timer_list *t);
|
2014-08-01 12:29:48 +02:00
|
|
|
struct kmem_cache *frags_cachep;
|
|
|
|
const char *frags_cache_name;
|
inet: frags: use rhashtables for reassembly units
Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.
Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.
After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)
$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608
A followup patch will change the limits for 64bit arches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 12:58:49 -07:00
|
|
|
struct rhashtable_params rhash_params;
|
2019-05-27 16:56:49 -07:00
|
|
|
refcount_t refcnt;
|
|
|
|
struct completion completion;
|
2007-10-15 02:31:52 -07:00
|
|
|
};
|
|
|
|
|
2014-08-01 12:29:48 +02:00
|
|
|
int inet_frags_init(struct inet_frags *);
|
2007-10-15 02:31:52 -07:00
|
|
|
void inet_frags_fini(struct inet_frags *);
|
|
|
|
|
2019-05-27 16:56:47 -07:00
|
|
|
int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net);
|
2019-06-18 11:09:00 -07:00
|
|
|
|
inet: fix compilation warnings in fqdir_pre_exit()
The linux-next commit "inet: fix various use-after-free in defrags
units" [1] introduced compilation warnings,
./include/net/inet_frag.h:117:1: warning: 'inline' is not at beginning
of declaration [-Wold-style-declaration]
static void inline fqdir_pre_exit(struct fqdir *fqdir)
^~~~~~
In file included from ./include/net/netns/ipv4.h:10,
from ./include/net/net_namespace.h:20,
from ./include/linux/netdevice.h:38,
from ./include/linux/icmpv6.h:13,
from ./include/linux/ipv6.h:86,
from ./include/net/ipv6.h:12,
from ./include/rdma/ib_verbs.h:51,
from ./include/linux/mlx5/device.h:37,
from ./include/linux/mlx5/driver.h:51,
from
drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:37:
[1] https://lore.kernel.org/netdev/20190618180900.88939-3-edumazet@google.com/
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-20 10:52:40 -04:00
|
|
|
static inline void fqdir_pre_exit(struct fqdir *fqdir)
|
2019-06-18 11:09:00 -07:00
|
|
|
{
|
2022-01-13 01:22:29 -08:00
|
|
|
/* Prevent creation of new frags.
|
|
|
|
* Pairs with READ_ONCE() in inet_frag_find().
|
|
|
|
*/
|
|
|
|
WRITE_ONCE(fqdir->high_thresh, 0);
|
|
|
|
|
|
|
|
/* Pairs with READ_ONCE() in inet_frag_kill(), ip_expire()
|
|
|
|
* and ip6frag_expire_frag_queue().
|
|
|
|
*/
|
|
|
|
WRITE_ONCE(fqdir->dead, true);
|
2019-06-18 11:09:00 -07:00
|
|
|
}
|
2019-05-24 09:03:31 -07:00
|
|
|
void fqdir_exit(struct fqdir *fqdir);
|
2008-01-22 06:06:23 -08:00
|
|
|
|
2018-03-31 12:58:44 -07:00
|
|
|
void inet_frag_kill(struct inet_frag_queue *q);
|
|
|
|
void inet_frag_destroy(struct inet_frag_queue *q);
|
2019-05-24 09:03:30 -07:00
|
|
|
struct inet_frag_queue *inet_frag_find(struct fqdir *fqdir, void *key);
|
2007-10-15 02:37:18 -07:00
|
|
|
|
2018-08-11 20:27:24 +00:00
|
|
|
/* Free all skbs in the queue; return the sum of their truesizes. */
|
2022-10-29 15:45:19 +00:00
|
|
|
unsigned int inet_frag_rbtree_purge(struct rb_root *root,
|
|
|
|
enum skb_drop_reason reason);
|
2018-08-11 20:27:24 +00:00
|
|
|
|
2018-03-31 12:58:44 -07:00
|
|
|
static inline void inet_frag_put(struct inet_frag_queue *q)
|
2007-10-15 02:41:56 -07:00
|
|
|
{
|
2017-06-30 13:08:07 +03:00
|
|
|
if (refcount_dec_and_test(&q->refcnt))
|
2018-03-31 12:58:44 -07:00
|
|
|
inet_frag_destroy(q);
|
2007-10-15 02:41:56 -07:00
|
|
|
}
|
|
|
|
|
2013-01-28 23:45:12 +00:00
|
|
|
/* Memory Tracking Functions. */
|
|
|
|
|
2019-05-24 09:03:30 -07:00
|
|
|
static inline long frag_mem_limit(const struct fqdir *fqdir)
|
2013-01-28 23:45:12 +00:00
|
|
|
{
|
2019-05-24 09:03:30 -07:00
|
|
|
return atomic_long_read(&fqdir->mem);
|
2013-01-28 23:45:12 +00:00
|
|
|
}
|
|
|
|
|
2019-05-24 09:03:30 -07:00
|
|
|
static inline void sub_frag_mem_limit(struct fqdir *fqdir, long val)
|
2013-01-28 23:45:12 +00:00
|
|
|
{
|
2019-05-24 09:03:30 -07:00
|
|
|
atomic_long_sub(val, &fqdir->mem);
|
2013-01-28 23:45:12 +00:00
|
|
|
}
|
|
|
|
|
2019-05-24 09:03:30 -07:00
|
|
|
static inline void add_frag_mem_limit(struct fqdir *fqdir, long val)
|
2013-01-28 23:45:12 +00:00
|
|
|
{
|
2019-05-24 09:03:30 -07:00
|
|
|
atomic_long_add(val, &fqdir->mem);
|
2013-01-28 23:45:12 +00:00
|
|
|
}
|
|
|
|
|
2013-03-22 08:24:37 +00:00
|
|
|
/* RFC 3168 support :
|
|
|
|
* We want to check ECN values of all fragments, do detect invalid combinations.
|
|
|
|
* In ipq->ecn, we store the OR value of each ip4_frag_ecn() fragment value.
|
|
|
|
*/
|
|
|
|
#define IPFRAG_ECN_NOT_ECT 0x01 /* one frag had ECN_NOT_ECT */
|
|
|
|
#define IPFRAG_ECN_ECT_1 0x02 /* one frag had ECN_ECT_1 */
|
|
|
|
#define IPFRAG_ECN_ECT_0 0x04 /* one frag had ECN_ECT_0 */
|
|
|
|
#define IPFRAG_ECN_CE 0x08 /* one frag had ECN_CE */
|
|
|
|
|
|
|
|
extern const u8 ip_frag_ecn_table[16];
|
|
|
|
|
2019-01-22 10:02:50 -08:00
|
|
|
/* Return values of inet_frag_queue_insert() */
|
|
|
|
#define IPFRAG_OK 0
|
|
|
|
#define IPFRAG_DUP 1
|
|
|
|
#define IPFRAG_OVERLAP 2
|
|
|
|
int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
|
|
|
|
int offset, int end);
|
|
|
|
void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
|
|
|
|
struct sk_buff *parent);
|
|
|
|
void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
|
inet: frags: re-introduce skb coalescing for local delivery
Before commit d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6
defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
generating many fragments) and running over an IPsec tunnel, reported
more than 6Gbps throughput. After that patch, the same test gets only
9Mbps when receiving on a be2net nic (driver can make a big difference
here, for example, ixgbe doesn't seem to be affected).
By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
(IPv4 fragment coalescing was dropped by commit 14fe22e33462 ("Revert
"ipv4: use skb coalescing in defragmentation"")).
Without fragment coalescing, be2net runs out of Rx ring entries and
starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
the netperf traffic is only composed of UDP fragments, any lost packet
prevents reassembly of the full datagram. Therefore, fragments which
have no possibility to ever get reassembled pile up in the reassembly
queue, until the memory accounting exeeds the threshold. At that point
no fragment is accepted anymore, which effectively discards all
netperf traffic.
When reassembly timeout expires, some stale fragments are removed from
the reassembly queue, so a few packets can be received, reassembled
and delivered to the netperf receiver. But the nic still drops frames
and soon the reassembly queue gets filled again with stale fragments.
These long time frames where no datagram can be received explain why
the performance drop is so significant.
Re-introducing fragment coalescing is enough to get the initial
performances again (6.6Gbps with be2net): driver doesn't drop frames
anymore (no more rx_drops_no_frags errors) and the reassembly engine
works at full speed.
This patch is quite conservative and only coalesces skbs for local
IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
forwarding). Coalescing could be extended in the future if need be, as
more scenarios would probably benefit from it.
[0]: Test configuration
Sender:
ip xfrm policy flush
ip xfrm state flush
ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
netserver -D -L fc00:2::1
Receiver:
ip xfrm policy flush
ip xfrm state flush
ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-02 17:15:03 +02:00
|
|
|
void *reasm_data, bool try_coalesce);
|
2019-01-22 10:02:50 -08:00
|
|
|
struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
|
|
|
|
|
2007-10-15 02:24:19 -07:00
|
|
|
#endif
|