2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* Linux Socket Filter Data Structures
|
|
|
|
*/
|
|
|
|
#ifndef __LINUX_FILTER_H__
|
|
|
|
#define __LINUX_FILTER_H__
|
|
|
|
|
2011-07-26 16:09:06 -07:00
|
|
|
#include <linux/atomic.h>
|
2012-04-12 16:47:53 -05:00
|
|
|
#include <linux/compat.h>
|
2013-10-04 00:14:06 -07:00
|
|
|
#include <linux/workqueue.h>
|
2012-10-13 10:46:48 +01:00
|
|
|
#include <uapi/linux/filter.h>
|
2011-05-22 07:08:11 +00:00
|
|
|
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
/* Internally used and optimized filter representation with extended
|
|
|
|
* instruction set based on top of classic BPF.
|
2012-04-12 16:47:53 -05:00
|
|
|
*/
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
|
|
|
|
/* instruction classes */
|
|
|
|
#define BPF_ALU64 0x07 /* alu mode in double word width */
|
|
|
|
|
|
|
|
/* ld/ldx fields */
|
|
|
|
#define BPF_DW 0x18 /* double word */
|
|
|
|
#define BPF_XADD 0xc0 /* exclusive add */
|
|
|
|
|
|
|
|
/* alu/jmp fields */
|
|
|
|
#define BPF_MOV 0xb0 /* mov reg to reg */
|
|
|
|
#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
|
|
|
|
|
|
|
|
/* change endianness of a register */
|
|
|
|
#define BPF_END 0xd0 /* flags for endianness conversion: */
|
|
|
|
#define BPF_TO_LE 0x00 /* convert to little-endian */
|
|
|
|
#define BPF_TO_BE 0x08 /* convert to big-endian */
|
|
|
|
#define BPF_FROM_LE BPF_TO_LE
|
|
|
|
#define BPF_FROM_BE BPF_TO_BE
|
|
|
|
|
|
|
|
#define BPF_JNE 0x50 /* jump != */
|
|
|
|
#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
|
|
|
|
#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
|
|
|
|
#define BPF_CALL 0x80 /* function call */
|
|
|
|
#define BPF_EXIT 0x90 /* function return */
|
|
|
|
|
2014-05-01 18:34:18 +02:00
|
|
|
/* Placeholder/dummy for 0 */
|
|
|
|
#define BPF_0 0
|
|
|
|
|
2014-05-01 18:34:19 +02:00
|
|
|
/* Register numbers */
|
|
|
|
enum {
|
|
|
|
BPF_REG_0 = 0,
|
|
|
|
BPF_REG_1,
|
|
|
|
BPF_REG_2,
|
|
|
|
BPF_REG_3,
|
|
|
|
BPF_REG_4,
|
|
|
|
BPF_REG_5,
|
|
|
|
BPF_REG_6,
|
|
|
|
BPF_REG_7,
|
|
|
|
BPF_REG_8,
|
|
|
|
BPF_REG_9,
|
|
|
|
BPF_REG_10,
|
|
|
|
__MAX_BPF_REG,
|
|
|
|
};
|
|
|
|
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
/* BPF has 10 general purpose 64-bit registers and stack frame. */
|
2014-05-01 18:34:19 +02:00
|
|
|
#define MAX_BPF_REG __MAX_BPF_REG
|
|
|
|
|
|
|
|
/* ArgX, context and stack frame pointer register positions. Note,
|
|
|
|
* Arg1, Arg2, Arg3, etc are used as argument mappings of function
|
|
|
|
* calls in BPF_CALL instruction.
|
|
|
|
*/
|
|
|
|
#define BPF_REG_ARG1 BPF_REG_1
|
|
|
|
#define BPF_REG_ARG2 BPF_REG_2
|
|
|
|
#define BPF_REG_ARG3 BPF_REG_3
|
|
|
|
#define BPF_REG_ARG4 BPF_REG_4
|
|
|
|
#define BPF_REG_ARG5 BPF_REG_5
|
|
|
|
#define BPF_REG_CTX BPF_REG_6
|
|
|
|
#define BPF_REG_FP BPF_REG_10
|
|
|
|
|
|
|
|
/* Additional register mappings for converted user programs. */
|
|
|
|
#define BPF_REG_A BPF_REG_0
|
|
|
|
#define BPF_REG_X BPF_REG_7
|
|
|
|
#define BPF_REG_TMP BPF_REG_8
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
|
|
|
|
/* BPF program can access up to 512 bytes of stack space. */
|
|
|
|
#define MAX_BPF_STACK 512
|
|
|
|
|
2014-05-01 18:34:19 +02:00
|
|
|
/* Macro to invoke filter function. */
|
|
|
|
#define SK_RUN_FILTER(filter, ctx) (*filter->bpf_func)(ctx, filter->insnsi)
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
|
|
|
|
struct sock_filter_int {
|
|
|
|
__u8 code; /* opcode */
|
|
|
|
__u8 a_reg:4; /* dest register */
|
|
|
|
__u8 x_reg:4; /* source register */
|
|
|
|
__s16 off; /* signed offset */
|
|
|
|
__s32 imm; /* signed immediate constant */
|
|
|
|
};
|
|
|
|
|
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
/* A struct sock_filter is architecture independent. */
|
2012-04-12 16:47:53 -05:00
|
|
|
struct compat_sock_fprog {
|
|
|
|
u16 len;
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
compat_uptr_t filter; /* struct sock_filter * */
|
2012-04-12 16:47:53 -05:00
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:19 +01:00
|
|
|
struct sock_fprog_kern {
|
|
|
|
u16 len;
|
|
|
|
struct sock_filter *filter;
|
|
|
|
};
|
|
|
|
|
2011-05-22 07:08:11 +00:00
|
|
|
struct sk_buff;
|
|
|
|
struct sock;
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
struct seccomp_data;
|
2011-05-22 07:08:11 +00:00
|
|
|
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:19 +01:00
|
|
|
struct sk_filter {
|
2008-04-10 01:33:47 -07:00
|
|
|
atomic_t refcnt;
|
2014-03-28 18:58:18 +01:00
|
|
|
u32 jited:1, /* Is our filter JIT'ed? */
|
|
|
|
len:31; /* Number of filter blocks */
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:19 +01:00
|
|
|
struct sock_fprog_kern *orig_prog; /* Original BPF program */
|
2013-10-04 00:14:06 -07:00
|
|
|
struct rcu_head rcu;
|
2011-04-20 09:27:32 +00:00
|
|
|
unsigned int (*bpf_func)(const struct sk_buff *skb,
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
const struct sock_filter_int *filter);
|
2013-10-04 00:14:06 -07:00
|
|
|
union {
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
struct sock_filter insns[0];
|
|
|
|
struct sock_filter_int insnsi[0];
|
2013-10-04 00:14:06 -07:00
|
|
|
struct work_struct work;
|
|
|
|
};
|
2008-04-10 01:33:47 -07:00
|
|
|
};
|
|
|
|
|
2013-10-04 00:14:06 -07:00
|
|
|
static inline unsigned int sk_filter_size(unsigned int proglen)
|
2008-04-10 01:33:47 -07:00
|
|
|
{
|
2013-10-04 00:14:06 -07:00
|
|
|
return max(sizeof(struct sk_filter),
|
|
|
|
offsetof(struct sk_filter, insns[proglen]));
|
2008-04-10 01:33:47 -07:00
|
|
|
}
|
|
|
|
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:19 +01:00
|
|
|
#define sk_filter_proglen(fprog) \
|
|
|
|
(fprog->len * sizeof(fprog->filter[0]))
|
|
|
|
|
2014-03-28 18:58:20 +01:00
|
|
|
int sk_filter(struct sock *sk, struct sk_buff *skb);
|
net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:25 +01:00
|
|
|
|
|
|
|
u32 sk_run_filter_int_seccomp(const struct seccomp_data *ctx,
|
|
|
|
const struct sock_filter_int *insni);
|
|
|
|
u32 sk_run_filter_int_skb(const struct sk_buff *ctx,
|
|
|
|
const struct sock_filter_int *insni);
|
|
|
|
|
|
|
|
int sk_convert_filter(struct sock_filter *prog, int len,
|
|
|
|
struct sock_filter_int *new_prog, int *new_len);
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:19 +01:00
|
|
|
|
2014-03-28 18:58:20 +01:00
|
|
|
int sk_unattached_filter_create(struct sk_filter **pfp,
|
|
|
|
struct sock_fprog *fprog);
|
|
|
|
void sk_unattached_filter_destroy(struct sk_filter *fp);
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:19 +01:00
|
|
|
|
2014-03-28 18:58:20 +01:00
|
|
|
int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
|
|
|
|
int sk_detach_filter(struct sock *sk);
|
net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc92778080 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 18:58:19 +01:00
|
|
|
|
2014-03-28 18:58:20 +01:00
|
|
|
int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
|
|
|
|
int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
|
|
|
|
unsigned int len);
|
|
|
|
void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
|
|
|
|
|
|
|
|
void sk_filter_charge(struct sock *sk, struct sk_filter *fp);
|
|
|
|
void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
|
2011-04-20 09:27:32 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_BPF_JIT
|
2013-05-01 16:24:08 -04:00
|
|
|
#include <stdarg.h>
|
2013-03-28 15:24:53 +00:00
|
|
|
#include <linux/linkage.h>
|
|
|
|
#include <linux/printk.h>
|
|
|
|
|
2014-03-28 18:58:20 +01:00
|
|
|
void bpf_jit_compile(struct sk_filter *fp);
|
|
|
|
void bpf_jit_free(struct sk_filter *fp);
|
2013-03-21 22:22:03 +01:00
|
|
|
|
|
|
|
static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
|
|
|
|
u32 pass, void *image)
|
|
|
|
{
|
2013-05-17 16:57:37 +00:00
|
|
|
pr_err("flen=%u proglen=%u pass=%u image=%pK\n",
|
2013-03-21 22:22:03 +01:00
|
|
|
flen, proglen, pass, image);
|
|
|
|
if (image)
|
2013-05-17 16:57:37 +00:00
|
|
|
print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
|
2013-03-21 22:22:03 +01:00
|
|
|
16, 1, image, proglen, false);
|
|
|
|
}
|
2011-04-20 09:27:32 +00:00
|
|
|
#else
|
2013-10-04 00:14:06 -07:00
|
|
|
#include <linux/slab.h>
|
2011-04-20 09:27:32 +00:00
|
|
|
static inline void bpf_jit_compile(struct sk_filter *fp)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void bpf_jit_free(struct sk_filter *fp)
|
|
|
|
{
|
2013-10-04 00:14:06 -07:00
|
|
|
kfree(fp);
|
2011-04-20 09:27:32 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2014-01-17 17:09:45 +01:00
|
|
|
static inline int bpf_tell_extensions(void)
|
|
|
|
{
|
net: filter: let bpf_tell_extensions return SKF_AD_MAX
Michal Sekletar added in commit ea02f9411d9f ("net: introduce
SO_BPF_EXTENSIONS") a facility where user space can enquire
the BPF ancillary instruction set, which is imho a step into
the right direction for letting user space high-level to BPF
optimizers make an informed decision for possibly using these
extensions.
The original rationale was to return through a getsockopt(2)
a bitfield of which instructions are supported and which
are not, as of right now, we just return 0 to indicate a
base support for SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET.
Limitations of this approach are that this API which we need
to maintain for a long time can only support a maximum of 32
extensions, and needs to be additionally maintained/updated
when each new extension that comes in.
I thought about this a bit more and what we can do here to
overcome this is to just return SKF_AD_MAX. Since we never
remove any extension since we cannot break user space and
always linearly increase SKF_AD_MAX on each newly added
extension, user space can make a decision on what extensions
are supported in the whole set of extensions and which aren't,
by just checking which of them from the whole set have an
offset < SKF_AD_MAX of the underlying kernel.
Since SKF_AD_MAX must be updated each time we add new ones,
we don't need to introduce an additional enum and got
maintenance for free. At some point in time when
SO_BPF_EXTENSIONS becomes ubiquitous for most kernels, then
an application can simply make use of this and easily be run
on newer or older underlying kernels without needing to be
recompiled, of course. Since that is for 3.14, it's not too
late to do this change.
Cc: Michal Sekletar <msekleta@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Michal Sekletar <msekleta@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 00:19:37 +01:00
|
|
|
return SKF_AD_MAX;
|
2014-01-17 17:09:45 +01:00
|
|
|
}
|
|
|
|
|
2011-04-20 09:27:32 +00:00
|
|
|
enum {
|
|
|
|
BPF_S_RET_K = 1,
|
|
|
|
BPF_S_RET_A,
|
|
|
|
BPF_S_ALU_ADD_K,
|
|
|
|
BPF_S_ALU_ADD_X,
|
|
|
|
BPF_S_ALU_SUB_K,
|
|
|
|
BPF_S_ALU_SUB_X,
|
|
|
|
BPF_S_ALU_MUL_K,
|
|
|
|
BPF_S_ALU_MUL_X,
|
|
|
|
BPF_S_ALU_DIV_X,
|
2012-09-07 22:03:35 +00:00
|
|
|
BPF_S_ALU_MOD_K,
|
|
|
|
BPF_S_ALU_MOD_X,
|
2011-04-20 09:27:32 +00:00
|
|
|
BPF_S_ALU_AND_K,
|
|
|
|
BPF_S_ALU_AND_X,
|
|
|
|
BPF_S_ALU_OR_K,
|
|
|
|
BPF_S_ALU_OR_X,
|
2012-09-24 02:23:59 +00:00
|
|
|
BPF_S_ALU_XOR_K,
|
|
|
|
BPF_S_ALU_XOR_X,
|
2011-04-20 09:27:32 +00:00
|
|
|
BPF_S_ALU_LSH_K,
|
|
|
|
BPF_S_ALU_LSH_X,
|
|
|
|
BPF_S_ALU_RSH_K,
|
|
|
|
BPF_S_ALU_RSH_X,
|
|
|
|
BPF_S_ALU_NEG,
|
|
|
|
BPF_S_LD_W_ABS,
|
|
|
|
BPF_S_LD_H_ABS,
|
|
|
|
BPF_S_LD_B_ABS,
|
|
|
|
BPF_S_LD_W_LEN,
|
|
|
|
BPF_S_LD_W_IND,
|
|
|
|
BPF_S_LD_H_IND,
|
|
|
|
BPF_S_LD_B_IND,
|
|
|
|
BPF_S_LD_IMM,
|
|
|
|
BPF_S_LDX_W_LEN,
|
|
|
|
BPF_S_LDX_B_MSH,
|
|
|
|
BPF_S_LDX_IMM,
|
|
|
|
BPF_S_MISC_TAX,
|
|
|
|
BPF_S_MISC_TXA,
|
|
|
|
BPF_S_ALU_DIV_K,
|
|
|
|
BPF_S_LD_MEM,
|
|
|
|
BPF_S_LDX_MEM,
|
|
|
|
BPF_S_ST,
|
|
|
|
BPF_S_STX,
|
|
|
|
BPF_S_JMP_JA,
|
|
|
|
BPF_S_JMP_JEQ_K,
|
|
|
|
BPF_S_JMP_JEQ_X,
|
|
|
|
BPF_S_JMP_JGE_K,
|
|
|
|
BPF_S_JMP_JGE_X,
|
|
|
|
BPF_S_JMP_JGT_K,
|
|
|
|
BPF_S_JMP_JGT_X,
|
|
|
|
BPF_S_JMP_JSET_K,
|
|
|
|
BPF_S_JMP_JSET_X,
|
|
|
|
/* Ancillary data */
|
|
|
|
BPF_S_ANC_PROTOCOL,
|
|
|
|
BPF_S_ANC_PKTTYPE,
|
|
|
|
BPF_S_ANC_IFINDEX,
|
|
|
|
BPF_S_ANC_NLATTR,
|
|
|
|
BPF_S_ANC_NLATTR_NEST,
|
|
|
|
BPF_S_ANC_MARK,
|
|
|
|
BPF_S_ANC_QUEUE,
|
|
|
|
BPF_S_ANC_HATYPE,
|
|
|
|
BPF_S_ANC_RXHASH,
|
|
|
|
BPF_S_ANC_CPU,
|
2012-03-31 11:01:20 +00:00
|
|
|
BPF_S_ANC_ALU_XOR_X,
|
2012-10-27 02:26:17 +00:00
|
|
|
BPF_S_ANC_VLAN_TAG,
|
|
|
|
BPF_S_ANC_VLAN_TAG_PRESENT,
|
filter: add ANC_PAY_OFFSET instruction for loading payload start offset
It is very useful to do dynamic truncation of packets. In particular,
we're interested to push the necessary header bytes to the user space and
cut off user payload that should probably not be transferred for some reasons
(e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET,
we can load it into the accumulator, and return it. E.g. in bpfc syntax ...
ld #poff ; { 0x20, 0, 0, 0xfffff034 },
ret a ; { 0x16, 0, 0, 0x00000000 },
... as a filter will accomplish this without having to do a big hackery in
a BPF filter itself. Follow-up JIT implementations are welcome.
Thanks to Eric Dumazet for suggesting and discussing this during the
Netfilter Workshop in Copenhagen.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-19 06:39:31 +00:00
|
|
|
BPF_S_ANC_PAY_OFFSET,
|
2014-04-21 09:21:24 -07:00
|
|
|
BPF_S_ANC_RANDOM,
|
2011-04-20 09:27:32 +00:00
|
|
|
};
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
#endif /* __LINUX_FILTER_H__ */
|