bpf-next-for-netdev

-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZlWtmQAKCRDbK58LschI
 g0TUAQDT76jx7Rq1DShCtZ3eqiBMNkYczK8b+GqNsSG8YGduaAEA1jn/GN+H65Rh
 atQZ/pYAfLZflMV04+XE0GyBr5q1uQg=
 =NczG
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-05-28

We've added 23 non-merge commits during the last 11 day(s) which contain
a total of 45 files changed, 696 insertions(+), 277 deletions(-).

The main changes are:

1) Rename skb's mono_delivery_time to tstamp_type for extensibility
   and add SKB_CLOCK_TAI type support to bpf_skb_set_tstamp(),
   from Abhishek Chauhan.

2) Add netfilter CT zone ID and direction to bpf_ct_opts so that arbitrary
   CT zones can be used from XDP/tc BPF netfilter CT helper functions,
   from Brad Cowie.

3) Several tweaks to the instruction-set.rst IETF doc to address
   the Last Call review comments, from Dave Thaler.

4) Small batch of riscv64 BPF JIT optimizations in order to emit more
   compressed instructions to the JITed image for better icache efficiency,
   from Xiao Wang.

5) Sort bpftool C dump output from BTF, aiming to simplify vmlinux.h
   diffing and forcing more natural type definitions ordering,
   from Mykyta Yatsenko.

6) Use DEV_STATS_INC() macro in BPF redirect helpers to silence
   a syzbot/KCSAN race report for the tx_errors counter,
   from Jiang Yunshui.

7) Un-constify bpf_func_info in bpftool to fix compilation with LLVM 17+
   which started treating const structs as constants and thus breaking
   full BTF program name resolution, from Ivan Babrou.

8) Fix up BPF program numbers in test_sockmap selftest in order to reduce
   some of the test-internal array sizes, from Geliang Tang.

9) Small cleanup in Makefile.btf script to use test-ge check for v1.25-only
   pahole, from Alan Maguire.

10) Fix bpftool's make dependencies for vmlinux.h in order to avoid needless
    rebuilds in some corner cases, from Artem Savkov.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (23 commits)
  bpf, net: Use DEV_STAT_INC()
  bpf, docs: Fix instruction.rst indentation
  bpf, docs: Clarify call local offset
  bpf, docs: Add table captions
  bpf, docs: clarify sign extension of 64-bit use of 32-bit imm
  bpf, docs: Use RFC 2119 language for ISA requirements
  bpf, docs: Move sentence about returning R0 to abi.rst
  bpf: constify member bpf_sysctl_kern:: Table
  riscv, bpf: Try RVC for reg move within BPF_CMPXCHG JIT
  riscv, bpf: Use STACK_ALIGN macro for size rounding up
  riscv, bpf: Optimize zextw insn with Zba extension
  selftests/bpf: Handle forwarding of UDP CLOCK_TAI packets
  net: Add additional bit to support clockid_t timestamp type
  net: Rename mono_delivery_time to tstamp_type for scalabilty
  selftests/bpf: Update tests for new ct zone opts for nf_conntrack kfuncs
  net: netfilter: Make ct zone opts configurable for bpf ct helpers
  selftests/bpf: Fix prog numbers in test_sockmap
  bpf: Remove unused variable "prev_state"
  bpftool: Un-const bpf_func_info to fix it for llvm 17 and newer
  bpf: Fix order of args in call to bpf_map_kvcalloc
  ...
====================

Link: https://lore.kernel.org/r/20240528105924.30905-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit is contained in:
Jakub Kicinski 2024-05-28 07:27:28 -07:00
commit 4b3529edbb
45 changed files with 698 additions and 279 deletions

View File

@ -23,3 +23,6 @@ The BPF calling convention is defined as:
R0 - R5 are scratch registers and BPF programs needs to spill/fill them if
necessary across calls.
The BPF program needs to store the return value into register R0 before doing an
``EXIT``.

View File

@ -14,6 +14,13 @@ set architecture (ISA).
Documentation conventions
=========================
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 `<https://www.rfc-editor.org/info/rfc2119>`_
`RFC8174 <https://www.rfc-editor.org/info/rfc8174>`_
when, and only when, they appear in all capitals, as shown here.
For brevity and consistency, this document refers to families
of types using a shorthand syntax and refers to several expository,
mnemonic functions when describing the semantics of instructions.
@ -25,7 +32,7 @@ Types
This document refers to integer types with the notation `SN` to specify
a type's signedness (`S`) and bit width (`N`), respectively.
.. table:: Meaning of signedness notation.
.. table:: Meaning of signedness notation
==== =========
S Meaning
@ -34,7 +41,7 @@ a type's signedness (`S`) and bit width (`N`), respectively.
s signed
==== =========
.. table:: Meaning of bit-width notation.
.. table:: Meaning of bit-width notation
===== =========
N Bit width
@ -106,9 +113,9 @@ Conformance groups
An implementation does not need to support all instructions specified in this
document (e.g., deprecated instructions). Instead, a number of conformance
groups are specified. An implementation must support the base32 conformance
group and may support additional conformance groups, where supporting a
conformance group means it must support all instructions in that conformance
groups are specified. An implementation MUST support the base32 conformance
group and MAY support additional conformance groups, where supporting a
conformance group means it MUST support all instructions in that conformance
group.
The use of named conformance groups enables interoperability between a runtime
@ -209,7 +216,7 @@ For example::
07 1 0 00 00 11 22 33 44 r1 += 0x11223344 // big
Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.
Unused fields SHALL be cleared to zero.
Wide instruction encoding
--------------------------
@ -256,18 +263,20 @@ Instruction classes
The three least significant bits of the 'opcode' field store the instruction class:
===== ===== =============================== ===================================
class value description reference
===== ===== =============================== ===================================
LD 0x0 non-standard load operations `Load and store instructions`_
LDX 0x1 load into register operations `Load and store instructions`_
ST 0x2 store from immediate operations `Load and store instructions`_
STX 0x3 store from register operations `Load and store instructions`_
ALU 0x4 32-bit arithmetic operations `Arithmetic and jump instructions`_
JMP 0x5 64-bit jump operations `Arithmetic and jump instructions`_
JMP32 0x6 32-bit jump operations `Arithmetic and jump instructions`_
ALU64 0x7 64-bit arithmetic operations `Arithmetic and jump instructions`_
===== ===== =============================== ===================================
.. table:: Instruction class
===== ===== =============================== ===================================
class value description reference
===== ===== =============================== ===================================
LD 0x0 non-standard load operations `Load and store instructions`_
LDX 0x1 load into register operations `Load and store instructions`_
ST 0x2 store from immediate operations `Load and store instructions`_
STX 0x3 store from register operations `Load and store instructions`_
ALU 0x4 32-bit arithmetic operations `Arithmetic and jump instructions`_
JMP 0x5 64-bit jump operations `Arithmetic and jump instructions`_
JMP32 0x6 32-bit jump operations `Arithmetic and jump instructions`_
ALU64 0x7 64-bit arithmetic operations `Arithmetic and jump instructions`_
===== ===== =============================== ===================================
Arithmetic and jump instructions
================================
@ -285,12 +294,14 @@ For arithmetic and jump instructions (``ALU``, ``ALU64``, ``JMP`` and
**s (source)**
the source operand location, which unless otherwise specified is one of:
====== ===== ==============================================
source value description
====== ===== ==============================================
K 0 use 32-bit 'imm' value as source operand
X 1 use 'src_reg' register value as source operand
====== ===== ==============================================
.. table:: Source operand location
====== ===== ==============================================
source value description
====== ===== ==============================================
K 0 use 32-bit 'imm' value as source operand
X 1 use 'src_reg' register value as source operand
====== ===== ==============================================
**instruction class**
the instruction class (see `Instruction classes`_)
@ -305,27 +316,29 @@ The 'code' field encodes the operation as below, where 'src' refers to the
the source operand and 'dst' refers to the value of the destination
register.
===== ===== ======= ==========================================================
name code offset description
===== ===== ======= ==========================================================
ADD 0x0 0 dst += src
SUB 0x1 0 dst -= src
MUL 0x2 0 dst \*= src
DIV 0x3 0 dst = (src != 0) ? (dst / src) : 0
SDIV 0x3 1 dst = (src != 0) ? (dst s/ src) : 0
OR 0x4 0 dst \|= src
AND 0x5 0 dst &= src
LSH 0x6 0 dst <<= (src & mask)
RSH 0x7 0 dst >>= (src & mask)
NEG 0x8 0 dst = -dst
MOD 0x9 0 dst = (src != 0) ? (dst % src) : dst
SMOD 0x9 1 dst = (src != 0) ? (dst s% src) : dst
XOR 0xa 0 dst ^= src
MOV 0xb 0 dst = src
MOVSX 0xb 8/16/32 dst = (s8,s16,s32)src
ARSH 0xc 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
END 0xd 0 byte swap operations (see `Byte swap instructions`_ below)
===== ===== ======= ==========================================================
.. table:: Arithmetic instructions
===== ===== ======= ==========================================================
name code offset description
===== ===== ======= ==========================================================
ADD 0x0 0 dst += src
SUB 0x1 0 dst -= src
MUL 0x2 0 dst \*= src
DIV 0x3 0 dst = (src != 0) ? (dst / src) : 0
SDIV 0x3 1 dst = (src != 0) ? (dst s/ src) : 0
OR 0x4 0 dst \|= src
AND 0x5 0 dst &= src
LSH 0x6 0 dst <<= (src & mask)
RSH 0x7 0 dst >>= (src & mask)
NEG 0x8 0 dst = -dst
MOD 0x9 0 dst = (src != 0) ? (dst % src) : dst
SMOD 0x9 1 dst = (src != 0) ? (dst s% src) : dst
XOR 0xa 0 dst ^= src
MOV 0xb 0 dst = src
MOVSX 0xb 8/16/32 dst = (s8,s16,s32)src
ARSH 0xc 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
END 0xd 0 byte swap operations (see `Byte swap instructions`_ below)
===== ===== ======= ==========================================================
Underflow and overflow are allowed during arithmetic operations, meaning
the 64-bit or 32-bit value will wrap. If BPF program execution would
@ -374,7 +387,7 @@ interpreted as a 64-bit signed value.
Note that there are varying definitions of the signed modulo operation
when the dividend or divisor are negative, where implementations often
vary by language such that Python, Ruby, etc. differ from C, Go, Java,
etc. This specification requires that signed modulo use truncated division
etc. This specification requires that signed modulo MUST use truncated division
(where -13 % 3 == -1) as implemented in C, Go, etc.::
a % n = a - n * trunc(a / n)
@ -386,6 +399,19 @@ The ``MOVSX`` instruction does a move operation with sign extension.
operands into 64-bit operands. Unlike other arithmetic instructions,
``MOVSX`` is only defined for register source operands (``X``).
``{MOV, K, ALU64}`` means::
dst = (s64)imm
``{MOV, X, ALU}`` means::
dst = (u32)src
``{MOVSX, X, ALU}`` with 'offset' 8 means::
dst = (u32)(s32)(s8)src
The ``NEG`` instruction is only defined when the source bit is clear
(``K``).
@ -404,15 +430,17 @@ only and do not use a separate source register or immediate value.
For ``ALU``, the 1-bit source operand field in the opcode is used to
select what byte order the operation converts from or to. For
``ALU64``, the 1-bit source operand field in the opcode is reserved
and must be set to 0.
and MUST be set to 0.
===== ======== ===== =================================================
class source value description
===== ======== ===== =================================================
ALU TO_LE 0 convert between host byte order and little endian
ALU TO_BE 1 convert between host byte order and big endian
ALU64 Reserved 0 do byte swap unconditionally
===== ======== ===== =================================================
.. table:: Byte swap instructions
===== ======== ===== =================================================
class source value description
===== ======== ===== =================================================
ALU TO_LE 0 convert between host byte order and little endian
ALU TO_BE 1 convert between host byte order and big endian
ALU64 Reserved 0 do byte swap unconditionally
===== ======== ===== =================================================
The 'imm' field encodes the width of the swap operations. The following widths
are supported: 16, 32 and 64. Width 64 operations belong to the base64
@ -448,27 +476,29 @@ otherwise identical operations, and indicates the base64 conformance
group unless otherwise specified.
The 'code' field encodes the operation as below:
======== ===== ======= ================================= ===================================================
code value src_reg description notes
======== ===== ======= ================================= ===================================================
JA 0x0 0x0 PC += offset {JA, K, JMP} only
JA 0x0 0x0 PC += imm {JA, K, JMP32} only
JEQ 0x1 any PC += offset if dst == src
JGT 0x2 any PC += offset if dst > src unsigned
JGE 0x3 any PC += offset if dst >= src unsigned
JSET 0x4 any PC += offset if dst & src
JNE 0x5 any PC += offset if dst != src
JSGT 0x6 any PC += offset if dst > src signed
JSGE 0x7 any PC += offset if dst >= src signed
CALL 0x8 0x0 call helper function by static ID {CALL, K, JMP} only, see `Helper functions`_
CALL 0x8 0x1 call PC += imm {CALL, K, JMP} only, see `Program-local functions`_
CALL 0x8 0x2 call helper function by BTF ID {CALL, K, JMP} only, see `Helper functions`_
EXIT 0x9 0x0 return {CALL, K, JMP} only
JLT 0xa any PC += offset if dst < src unsigned
JLE 0xb any PC += offset if dst <= src unsigned
JSLT 0xc any PC += offset if dst < src signed
JSLE 0xd any PC += offset if dst <= src signed
======== ===== ======= ================================= ===================================================
.. table:: Jump instructions
======== ===== ======= ================================= ===================================================
code value src_reg description notes
======== ===== ======= ================================= ===================================================
JA 0x0 0x0 PC += offset {JA, K, JMP} only
JA 0x0 0x0 PC += imm {JA, K, JMP32} only
JEQ 0x1 any PC += offset if dst == src
JGT 0x2 any PC += offset if dst > src unsigned
JGE 0x3 any PC += offset if dst >= src unsigned
JSET 0x4 any PC += offset if dst & src
JNE 0x5 any PC += offset if dst != src
JSGT 0x6 any PC += offset if dst > src signed
JSGE 0x7 any PC += offset if dst >= src signed
CALL 0x8 0x0 call helper function by static ID {CALL, K, JMP} only, see `Helper functions`_
CALL 0x8 0x1 call PC += imm {CALL, K, JMP} only, see `Program-local functions`_
CALL 0x8 0x2 call helper function by BTF ID {CALL, K, JMP} only, see `Helper functions`_
EXIT 0x9 0x0 return {CALL, K, JMP} only
JLT 0xa any PC += offset if dst < src unsigned
JLE 0xb any PC += offset if dst <= src unsigned
JSLT 0xc any PC += offset if dst < src signed
JSLE 0xd any PC += offset if dst <= src signed
======== ===== ======= ================================= ===================================================
where 'PC' denotes the program counter, and the offset to increment by
is in units of 64-bit instructions relative to the instruction following
@ -476,9 +506,6 @@ the jump instruction. Thus 'PC += 1' skips execution of the next
instruction if it's a basic instruction or results in undefined behavior
if the next instruction is a 128-bit wide instruction.
The BPF program needs to store the return value into register R0 before doing an
``EXIT``.
Example:
``{JSGE, X, JMP32}`` means::
@ -487,6 +514,10 @@ Example:
where 's>=' indicates a signed '>=' comparison.
``{JLE, K, JMP}`` means::
if dst <= (u64)(s64)imm goto +offset
``{JA, K, JMP32}`` means::
gotol +imm
@ -515,14 +546,16 @@ for each program type, but static IDs are unique across all program types.
Platforms that support the BPF Type Format (BTF) support identifying
a helper function by a BTF ID encoded in the 'imm' field, where the BTF ID
identifies the helper name and type.
identifies the helper name and type. Further documentation of BTF
is outside the scope of this document and is left for future work.
Program-local functions
~~~~~~~~~~~~~~~~~~~~~~~
Program-local functions are functions exposed by the same BPF program as the
caller, and are referenced by offset from the call instruction, similar to
``JA``. The offset is encoded in the 'imm' field of the call instruction.
An ``EXIT`` within the program-local function will return to the caller.
caller, and are referenced by offset from the instruction following the call
instruction, similar to ``JA``. The offset is encoded in the 'imm' field of
the call instruction. An ``EXIT`` within the program-local function will
return to the caller.
Load and store instructions
===========================
@ -537,6 +570,8 @@ For load and store instructions (``LD``, ``LDX``, ``ST``, and ``STX``), the
**mode**
The mode modifier is one of:
.. table:: Mode modifier
============= ===== ==================================== =============
mode modifier value description reference
============= ===== ==================================== =============
@ -551,6 +586,8 @@ For load and store instructions (``LD``, ``LDX``, ``ST``, and ``STX``), the
**sz (size)**
The size modifier is one of:
.. table:: Size modifier
==== ===== =====================
size value description
==== ===== =====================
@ -619,14 +656,16 @@ The 'imm' field is used to encode the actual atomic operation.
Simple atomic operation use a subset of the values defined to encode
arithmetic operations in the 'imm' field to encode the atomic operation:
======== ===== ===========
imm value description
======== ===== ===========
ADD 0x00 atomic add
OR 0x40 atomic or
AND 0x50 atomic and
XOR 0xa0 atomic xor
======== ===== ===========
.. table:: Simple atomic operations
======== ===== ===========
imm value description
======== ===== ===========
ADD 0x00 atomic add
OR 0x40 atomic or
AND 0x50 atomic and
XOR 0xa0 atomic xor
======== ===== ===========
``{ATOMIC, W, STX}`` with 'imm' = ADD means::
@ -640,13 +679,15 @@ XOR 0xa0 atomic xor
In addition to the simple atomic operations, there also is a modifier and
two complex atomic operations:
=========== ================ ===========================
imm value description
=========== ================ ===========================
FETCH 0x01 modifier: return old value
XCHG 0xe0 | FETCH atomic exchange
CMPXCHG 0xf0 | FETCH atomic compare and exchange
=========== ================ ===========================
.. table:: Complex atomic operations
=========== ================ ===========================
imm value description
=========== ================ ===========================
FETCH 0x01 modifier: return old value
XCHG 0xe0 | FETCH atomic exchange
CMPXCHG 0xf0 | FETCH atomic compare and exchange
=========== ================ ===========================
The ``FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations. If the ``FETCH`` flag
@ -673,17 +714,19 @@ The following table defines a set of ``{IMM, DW, LD}`` instructions
with opcode subtypes in the 'src_reg' field, using new terms such as "map"
defined further below:
======= ========================================= =========== ==============
src_reg pseudocode imm type dst type
======= ========================================= =========== ==============
0x0 dst = (next_imm << 32) | imm integer integer
0x1 dst = map_by_fd(imm) map fd map
0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data address
0x3 dst = var_addr(imm) variable id data address
0x4 dst = code_addr(imm) integer code address
0x5 dst = map_by_idx(imm) map index map
0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data address
======= ========================================= =========== ==============
.. table:: 64-bit immediate instructions
======= ========================================= =========== ==============
src_reg pseudocode imm type dst type
======= ========================================= =========== ==============
0x0 dst = (next_imm << 32) | imm integer integer
0x1 dst = map_by_fd(imm) map fd map
0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data address
0x3 dst = var_addr(imm) variable id data address
0x4 dst = code_addr(imm) integer code address
0x5 dst = map_by_idx(imm) map index map
0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data address
======= ========================================= =========== ==============
where
@ -725,5 +768,5 @@ carried over from classic BPF. These instructions used an instruction
class of ``LD``, a size modifier of ``W``, ``H``, or ``B``, and a
mode modifier of ``ABS`` or ``IND``. The 'dst_reg' and 'offset' fields were
set to zero, and 'src_reg' was set to zero for ``ABS``. However, these
instructions are deprecated and should no longer be used. All legacy packet
instructions are deprecated and SHOULD no longer be used. All legacy packet
access instructions belong to the "packet" conformance group.

View File

@ -604,6 +604,18 @@ config TOOLCHAIN_HAS_VECTOR_CRYPTO
def_bool $(as-instr, .option arch$(comma) +v$(comma) +zvkb)
depends on AS_HAS_OPTION_ARCH
config RISCV_ISA_ZBA
bool "Zba extension support for bit manipulation instructions"
default y
help
Add support for enabling optimisations in the kernel when the Zba
extension is detected at boot.
The Zba extension provides instructions to accelerate the generation
of addresses that index into arrays of basic data types.
If you don't know what to do here, say Y.
config RISCV_ISA_ZBB
bool "Zbb extension support for bit manipulation instructions"
depends on TOOLCHAIN_HAS_ZBB

View File

@ -18,6 +18,11 @@ static inline bool rvc_enabled(void)
return IS_ENABLED(CONFIG_RISCV_ISA_C);
}
static inline bool rvzba_enabled(void)
{
return IS_ENABLED(CONFIG_RISCV_ISA_ZBA) && riscv_has_extension_likely(RISCV_ISA_EXT_ZBA);
}
static inline bool rvzbb_enabled(void)
{
return IS_ENABLED(CONFIG_RISCV_ISA_ZBB) && riscv_has_extension_likely(RISCV_ISA_EXT_ZBB);
@ -939,6 +944,14 @@ static inline u16 rvc_sdsp(u32 imm9, u8 rs2)
return rv_css_insn(0x7, imm, rs2, 0x2);
}
/* RV64-only ZBA instructions. */
static inline u32 rvzba_zextw(u8 rd, u8 rs1)
{
/* add.uw rd, rs1, ZERO */
return rv_r_insn(0x04, RV_REG_ZERO, rs1, 0, rd, 0x3b);
}
#endif /* __riscv_xlen == 64 */
/* Helper functions that emit RVC instructions when possible. */
@ -1161,6 +1174,11 @@ static inline void emit_zexth(u8 rd, u8 rs, struct rv_jit_context *ctx)
static inline void emit_zextw(u8 rd, u8 rs, struct rv_jit_context *ctx)
{
if (rvzba_enabled()) {
emit(rvzba_zextw(rd, rs), ctx);
return;
}
emit_slli(rd, rs, 32, ctx);
emit_srli(rd, rd, 32, ctx);
}

View File

@ -537,8 +537,10 @@ static void emit_atomic(u8 rd, u8 rs, s16 off, s32 imm, bool is64,
/* r0 = atomic_cmpxchg(dst_reg + off16, r0, src_reg); */
case BPF_CMPXCHG:
r0 = bpf_to_rv_reg(BPF_REG_0, ctx);
emit(is64 ? rv_addi(RV_REG_T2, r0, 0) :
rv_addiw(RV_REG_T2, r0, 0), ctx);
if (is64)
emit_mv(RV_REG_T2, r0, ctx);
else
emit_addiw(RV_REG_T2, r0, 0, ctx);
emit(is64 ? rv_lr_d(r0, 0, rd, 0, 0) :
rv_lr_w(r0, 0, rd, 0, 0), ctx);
jmp_offset = ninsns_rvoff(8);
@ -868,7 +870,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im,
stack_size += 8;
sreg_off = stack_size;
stack_size = round_up(stack_size, 16);
stack_size = round_up(stack_size, STACK_ALIGN);
if (!is_struct_ops) {
/* For the trampoline called from function entry,
@ -1960,7 +1962,7 @@ void bpf_jit_build_prologue(struct rv_jit_context *ctx, bool is_subprog)
{
int i, stack_adjust = 0, store_offset, bpf_stack_adjust;
bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, STACK_ALIGN);
if (bpf_stack_adjust)
mark_fp(ctx);
@ -1982,7 +1984,7 @@ void bpf_jit_build_prologue(struct rv_jit_context *ctx, bool is_subprog)
if (ctx->arena_vm_start)
stack_adjust += 8;
stack_adjust = round_up(stack_adjust, 16);
stack_adjust = round_up(stack_adjust, STACK_ALIGN);
stack_adjust += bpf_stack_adjust;
store_offset = stack_adjust - 8;

View File

@ -1406,7 +1406,7 @@ struct bpf_sock_ops_kern {
struct bpf_sysctl_kern {
struct ctl_table_header *head;
struct ctl_table *table;
const struct ctl_table *table;
void *cur_val;
size_t cur_len;
void *new_val;

View File

@ -706,6 +706,13 @@ typedef unsigned int sk_buff_data_t;
typedef unsigned char *sk_buff_data_t;
#endif
enum skb_tstamp_type {
SKB_CLOCK_REALTIME,
SKB_CLOCK_MONOTONIC,
SKB_CLOCK_TAI,
__SKB_CLOCK_MAX = SKB_CLOCK_TAI,
};
/**
* DOC: Basic sk_buff geometry
*
@ -823,10 +830,8 @@ typedef unsigned char *sk_buff_data_t;
* @dst_pending_confirm: need to confirm neighbour
* @decrypted: Decrypted SKB
* @slow_gro: state present at GRO time, slower prepare step required
* @mono_delivery_time: When set, skb->tstamp has the
* delivery_time in mono clock base (i.e. EDT). Otherwise, the
* skb->tstamp has the (rcv) timestamp at ingress and
* delivery_time at egress.
* @tstamp_type: When set, skb->tstamp has the
* delivery_time clock base of skb->tstamp.
* @napi_id: id of the NAPI struct this skb came from
* @sender_cpu: (aka @napi_id) source CPU in XPS
* @alloc_cpu: CPU which did the skb allocation.
@ -954,7 +959,7 @@ struct sk_buff {
/* private: */
__u8 __mono_tc_offset[0];
/* public: */
__u8 mono_delivery_time:1; /* See SKB_MONO_DELIVERY_TIME_MASK */
__u8 tstamp_type:2; /* See skb_tstamp_type */
#ifdef CONFIG_NET_XGRESS
__u8 tc_at_ingress:1; /* See TC_AT_INGRESS_MASK */
__u8 tc_skip_classify:1;
@ -1084,15 +1089,16 @@ struct sk_buff {
#endif
#define PKT_TYPE_OFFSET offsetof(struct sk_buff, __pkt_type_offset)
/* if you move tc_at_ingress or mono_delivery_time
/* if you move tc_at_ingress or tstamp_type
* around, you also must adapt these constants.
*/
#ifdef __BIG_ENDIAN_BITFIELD
#define SKB_MONO_DELIVERY_TIME_MASK (1 << 7)
#define TC_AT_INGRESS_MASK (1 << 6)
#define SKB_TSTAMP_TYPE_MASK (3 << 6)
#define SKB_TSTAMP_TYPE_RSHIFT (6)
#define TC_AT_INGRESS_MASK (1 << 5)
#else
#define SKB_MONO_DELIVERY_TIME_MASK (1 << 0)
#define TC_AT_INGRESS_MASK (1 << 1)
#define SKB_TSTAMP_TYPE_MASK (3)
#define TC_AT_INGRESS_MASK (1 << 2)
#endif
#define SKB_BF_MONO_TC_OFFSET offsetof(struct sk_buff, __mono_tc_offset)
@ -4179,7 +4185,7 @@ static inline void skb_get_new_timestampns(const struct sk_buff *skb,
static inline void __net_timestamp(struct sk_buff *skb)
{
skb->tstamp = ktime_get_real();
skb->mono_delivery_time = 0;
skb->tstamp_type = SKB_CLOCK_REALTIME;
}
static inline ktime_t net_timedelta(ktime_t t)
@ -4188,10 +4194,36 @@ static inline ktime_t net_timedelta(ktime_t t)
}
static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
bool mono)
u8 tstamp_type)
{
skb->tstamp = kt;
skb->mono_delivery_time = kt && mono;
if (kt)
skb->tstamp_type = tstamp_type;
else
skb->tstamp_type = SKB_CLOCK_REALTIME;
}
static inline void skb_set_delivery_type_by_clockid(struct sk_buff *skb,
ktime_t kt, clockid_t clockid)
{
u8 tstamp_type = SKB_CLOCK_REALTIME;
switch (clockid) {
case CLOCK_REALTIME:
break;
case CLOCK_MONOTONIC:
tstamp_type = SKB_CLOCK_MONOTONIC;
break;
case CLOCK_TAI:
tstamp_type = SKB_CLOCK_TAI;
break;
default:
WARN_ON_ONCE(1);
kt = 0;
}
skb_set_delivery_time(skb, kt, tstamp_type);
}
DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
@ -4201,8 +4233,8 @@ DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
*/
static inline void skb_clear_delivery_time(struct sk_buff *skb)
{
if (skb->mono_delivery_time) {
skb->mono_delivery_time = 0;
if (skb->tstamp_type) {
skb->tstamp_type = SKB_CLOCK_REALTIME;
if (static_branch_unlikely(&netstamp_needed_key))
skb->tstamp = ktime_get_real();
else
@ -4212,7 +4244,7 @@ static inline void skb_clear_delivery_time(struct sk_buff *skb)
static inline void skb_clear_tstamp(struct sk_buff *skb)
{
if (skb->mono_delivery_time)
if (skb->tstamp_type)
return;
skb->tstamp = 0;
@ -4220,7 +4252,7 @@ static inline void skb_clear_tstamp(struct sk_buff *skb)
static inline ktime_t skb_tstamp(const struct sk_buff *skb)
{
if (skb->mono_delivery_time)
if (skb->tstamp_type)
return 0;
return skb->tstamp;
@ -4228,7 +4260,7 @@ static inline ktime_t skb_tstamp(const struct sk_buff *skb)
static inline ktime_t skb_tstamp_cond(const struct sk_buff *skb, bool cond)
{
if (!skb->mono_delivery_time && skb->tstamp)
if (skb->tstamp_type != SKB_CLOCK_MONOTONIC && skb->tstamp)
return skb->tstamp;
if (static_branch_unlikely(&netstamp_needed_key) || cond)

View File

@ -76,7 +76,7 @@ struct frag_v6_compare_key {
* @stamp: timestamp of the last received fragment
* @len: total length of the original datagram
* @meat: length of received fragments so far
* @mono_delivery_time: stamp has a mono delivery time (EDT)
* @tstamp_type: stamp has a mono delivery time (EDT)
* @flags: fragment queue flags
* @max_size: maximum received fragment size
* @fqdir: pointer to struct fqdir
@ -97,7 +97,7 @@ struct inet_frag_queue {
ktime_t stamp;
int len;
int meat;
u8 mono_delivery_time;
u8 tstamp_type;
__u8 flags;
u16 max_size;
struct fqdir *fqdir;

View File

@ -6207,12 +6207,17 @@ union { \
__u64 :64; \
} __attribute__((aligned(8)))
/* The enum used in skb->tstamp_type. It specifies the clock type
* of the time stored in the skb->tstamp.
*/
enum {
BPF_SKB_TSTAMP_UNSPEC,
BPF_SKB_TSTAMP_DELIVERY_MONO, /* tstamp has mono delivery time */
/* For any BPF_SKB_TSTAMP_* that the bpf prog cannot handle,
* the bpf prog should handle it like BPF_SKB_TSTAMP_UNSPEC
* and try to deduce it by ingress, egress or skb->sk->sk_clockid.
BPF_SKB_TSTAMP_UNSPEC = 0, /* DEPRECATED */
BPF_SKB_TSTAMP_DELIVERY_MONO = 1, /* DEPRECATED */
BPF_SKB_CLOCK_REALTIME = 0,
BPF_SKB_CLOCK_MONOTONIC = 1,
BPF_SKB_CLOCK_TAI = 2,
/* For any future BPF_SKB_CLOCK_* that the bpf prog cannot handle,
* the bpf prog can try to deduce it by ingress/egress/skb->sk->sk_clockid.
*/
};

View File

@ -782,8 +782,8 @@ bpf_local_storage_map_alloc(union bpf_attr *attr,
nbuckets = max_t(u32, 2, nbuckets);
smap->bucket_log = ilog2(nbuckets);
smap->buckets = bpf_map_kvcalloc(&smap->map, sizeof(*smap->buckets),
nbuckets, GFP_USER | __GFP_NOWARN);
smap->buckets = bpf_map_kvcalloc(&smap->map, nbuckets,
sizeof(*smap->buckets), GFP_USER | __GFP_NOWARN);
if (!smap->buckets) {
err = -ENOMEM;
goto free_smap;

View File

@ -32,7 +32,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
struct sk_buff *))
{
int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
bool mono_delivery_time = skb->mono_delivery_time;
u8 tstamp_type = skb->tstamp_type;
unsigned int hlen, ll_rs, mtu;
ktime_t tstamp = skb->tstamp;
struct ip_frag_state state;
@ -82,7 +82,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
if (iter.frag)
ip_fraglist_prepare(skb, &iter);
skb_set_delivery_time(skb, tstamp, mono_delivery_time);
skb_set_delivery_time(skb, tstamp, tstamp_type);
err = output(net, sk, data, skb);
if (err || !iter.frag)
break;
@ -113,7 +113,7 @@ static int nf_br_ip_fragment(struct net *net, struct sock *sk,
goto blackhole;
}
skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
skb_set_delivery_time(skb2, tstamp, tstamp_type);
err = output(net, sk, data, skb2);
if (err)
goto blackhole;

View File

@ -2160,7 +2160,7 @@ EXPORT_SYMBOL(net_disable_timestamp);
static inline void net_timestamp_set(struct sk_buff *skb)
{
skb->tstamp = 0;
skb->mono_delivery_time = 0;
skb->tstamp_type = SKB_CLOCK_REALTIME;
if (static_branch_unlikely(&netstamp_needed_key))
skb->tstamp = ktime_get_real();
}

View File

@ -2274,12 +2274,12 @@ static int __bpf_redirect_neigh_v6(struct sk_buff *skb, struct net_device *dev,
err = bpf_out_neigh_v6(net, skb, dev, nh);
if (unlikely(net_xmit_eval(err)))
dev->stats.tx_errors++;
DEV_STATS_INC(dev, tx_errors);
else
ret = NET_XMIT_SUCCESS;
goto out_xmit;
out_drop:
dev->stats.tx_errors++;
DEV_STATS_INC(dev, tx_errors);
kfree_skb(skb);
out_xmit:
return ret;
@ -2380,12 +2380,12 @@ static int __bpf_redirect_neigh_v4(struct sk_buff *skb, struct net_device *dev,
err = bpf_out_neigh_v4(net, skb, dev, nh);
if (unlikely(net_xmit_eval(err)))
dev->stats.tx_errors++;
DEV_STATS_INC(dev, tx_errors);
else
ret = NET_XMIT_SUCCESS;
goto out_xmit;
out_drop:
dev->stats.tx_errors++;
DEV_STATS_INC(dev, tx_errors);
kfree_skb(skb);
out_xmit:
return ret;
@ -7726,17 +7726,21 @@ BPF_CALL_3(bpf_skb_set_tstamp, struct sk_buff *, skb,
return -EOPNOTSUPP;
switch (tstamp_type) {
case BPF_SKB_TSTAMP_DELIVERY_MONO:
case BPF_SKB_CLOCK_REALTIME:
skb->tstamp = tstamp;
skb->tstamp_type = SKB_CLOCK_REALTIME;
break;
case BPF_SKB_CLOCK_MONOTONIC:
if (!tstamp)
return -EINVAL;
skb->tstamp = tstamp;
skb->mono_delivery_time = 1;
skb->tstamp_type = SKB_CLOCK_MONOTONIC;
break;
case BPF_SKB_TSTAMP_UNSPEC:
if (tstamp)
case BPF_SKB_CLOCK_TAI:
if (!tstamp)
return -EINVAL;
skb->tstamp = 0;
skb->mono_delivery_time = 0;
skb->tstamp = tstamp;
skb->tstamp_type = SKB_CLOCK_TAI;
break;
default:
return -EINVAL;
@ -9387,16 +9391,17 @@ static struct bpf_insn *bpf_convert_tstamp_type_read(const struct bpf_insn *si,
{
__u8 value_reg = si->dst_reg;
__u8 skb_reg = si->src_reg;
/* AX is needed because src_reg and dst_reg could be the same */
__u8 tmp_reg = BPF_REG_AX;
*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg,
SKB_BF_MONO_TC_OFFSET);
*insn++ = BPF_JMP32_IMM(BPF_JSET, tmp_reg,
SKB_MONO_DELIVERY_TIME_MASK, 2);
*insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_TSTAMP_UNSPEC);
*insn++ = BPF_JMP_A(1);
*insn++ = BPF_MOV32_IMM(value_reg, BPF_SKB_TSTAMP_DELIVERY_MONO);
BUILD_BUG_ON(__SKB_CLOCK_MAX != (int)BPF_SKB_CLOCK_TAI);
BUILD_BUG_ON(SKB_CLOCK_REALTIME != (int)BPF_SKB_CLOCK_REALTIME);
BUILD_BUG_ON(SKB_CLOCK_MONOTONIC != (int)BPF_SKB_CLOCK_MONOTONIC);
BUILD_BUG_ON(SKB_CLOCK_TAI != (int)BPF_SKB_CLOCK_TAI);
*insn++ = BPF_LDX_MEM(BPF_B, value_reg, skb_reg, SKB_BF_MONO_TC_OFFSET);
*insn++ = BPF_ALU32_IMM(BPF_AND, value_reg, SKB_TSTAMP_TYPE_MASK);
#ifdef __BIG_ENDIAN_BITFIELD
*insn++ = BPF_ALU32_IMM(BPF_RSH, value_reg, SKB_TSTAMP_TYPE_RSHIFT);
#else
BUILD_BUG_ON(!(SKB_TSTAMP_TYPE_MASK & 0x1));
#endif
return insn;
}
@ -9439,11 +9444,12 @@ static struct bpf_insn *bpf_convert_tstamp_read(const struct bpf_prog *prog,
__u8 tmp_reg = BPF_REG_AX;
*insn++ = BPF_LDX_MEM(BPF_B, tmp_reg, skb_reg, SKB_BF_MONO_TC_OFFSET);
*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
TC_AT_INGRESS_MASK | SKB_MONO_DELIVERY_TIME_MASK);
*insn++ = BPF_JMP32_IMM(BPF_JNE, tmp_reg,
TC_AT_INGRESS_MASK | SKB_MONO_DELIVERY_TIME_MASK, 2);
/* skb->tc_at_ingress && skb->mono_delivery_time,
/* check if ingress mask bits is set */
*insn++ = BPF_JMP32_IMM(BPF_JSET, tmp_reg, TC_AT_INGRESS_MASK, 1);
*insn++ = BPF_JMP_A(4);
*insn++ = BPF_JMP32_IMM(BPF_JSET, tmp_reg, SKB_TSTAMP_TYPE_MASK, 1);
*insn++ = BPF_JMP_A(2);
/* skb->tc_at_ingress && skb->tstamp_type,
* read 0 as the (rcv) timestamp.
*/
*insn++ = BPF_MOV64_IMM(value_reg, 0);
@ -9468,7 +9474,7 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
* the bpf prog is aware the tstamp could have delivery time.
* Thus, write skb->tstamp as is if tstamp_type_access is true.
* Otherwise, writing at ingress will have to clear the
* mono_delivery_time bit also.
* skb->tstamp_type bit also.
*/
if (!prog->tstamp_type_access) {
__u8 tmp_reg = BPF_REG_AX;
@ -9478,8 +9484,8 @@ static struct bpf_insn *bpf_convert_tstamp_write(const struct bpf_prog *prog,
*insn++ = BPF_JMP32_IMM(BPF_JSET, tmp_reg, TC_AT_INGRESS_MASK, 1);
/* goto <store> */
*insn++ = BPF_JMP_A(2);
/* <clear>: mono_delivery_time */
*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, ~SKB_MONO_DELIVERY_TIME_MASK);
/* <clear>: skb->tstamp_type */
*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg, ~SKB_TSTAMP_TYPE_MASK);
*insn++ = BPF_STX_MEM(BPF_B, skb_reg, tmp_reg, SKB_BF_MONO_TC_OFFSET);
}
#endif

View File

@ -130,7 +130,7 @@ static int lowpan_frag_queue(struct lowpan_frag_queue *fq,
goto err;
fq->q.stamp = skb->tstamp;
fq->q.mono_delivery_time = skb->mono_delivery_time;
fq->q.tstamp_type = skb->tstamp_type;
if (frag_type == LOWPAN_DISPATCH_FRAG1)
fq->q.flags |= INET_FRAG_FIRST_IN;

View File

@ -619,7 +619,7 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
skb_mark_not_on_list(head);
head->prev = NULL;
head->tstamp = q->stamp;
head->mono_delivery_time = q->mono_delivery_time;
head->tstamp_type = q->tstamp_type;
if (sk)
refcount_add(sum_truesize - head_truesize, &sk->sk_wmem_alloc);

View File

@ -355,7 +355,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
qp->iif = dev->ifindex;
qp->q.stamp = skb->tstamp;
qp->q.mono_delivery_time = skb->mono_delivery_time;
qp->q.tstamp_type = skb->tstamp_type;
qp->q.meat += skb->len;
qp->ecn |= ecn;
add_frag_mem_limit(qp->q.fqdir, skb->truesize);

View File

@ -764,7 +764,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
{
struct iphdr *iph;
struct sk_buff *skb2;
bool mono_delivery_time = skb->mono_delivery_time;
u8 tstamp_type = skb->tstamp_type;
struct rtable *rt = skb_rtable(skb);
unsigned int mtu, hlen, ll_rs;
struct ip_fraglist_iter iter;
@ -856,7 +856,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
}
}
skb_set_delivery_time(skb, tstamp, mono_delivery_time);
skb_set_delivery_time(skb, tstamp, tstamp_type);
err = output(net, sk, skb);
if (!err)
@ -912,7 +912,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
/*
* Put this fragment into the sending queue.
*/
skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
skb_set_delivery_time(skb2, tstamp, tstamp_type);
err = output(net, sk, skb2);
if (err)
goto fail;
@ -1457,7 +1457,10 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
skb->priority = (cork->tos != -1) ? cork->priority: READ_ONCE(sk->sk_priority);
skb->mark = cork->mark;
skb->tstamp = cork->transmit_time;
if (sk_is_tcp(sk))
skb_set_delivery_time(skb, cork->transmit_time, SKB_CLOCK_MONOTONIC);
else
skb_set_delivery_type_by_clockid(skb, cork->transmit_time, sk->sk_clockid);
/*
* Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
* on dst refcount
@ -1649,7 +1652,8 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
arg->csumoffset) = csum_fold(csum_add(nskb->csum,
arg->csum));
nskb->ip_summed = CHECKSUM_NONE;
nskb->mono_delivery_time = !!transmit_time;
if (transmit_time)
nskb->tstamp_type = SKB_CLOCK_MONOTONIC;
if (txhash)
skb_set_hash(nskb, txhash, PKT_HASH_TYPE_L4);
ip_push_pending_frames(sk, &fl4);

View File

@ -360,7 +360,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
skb->protocol = htons(ETH_P_IP);
skb->priority = READ_ONCE(sk->sk_priority);
skb->mark = sockc->mark;
skb->tstamp = sockc->transmit_time;
skb_set_delivery_type_by_clockid(skb, sockc->transmit_time, sk->sk_clockid);
skb_dst_set(skb, &rt->dst);
*rtp = NULL;

View File

@ -3625,6 +3625,8 @@ void __init tcp_v4_init(void)
*/
inet_sk(sk)->pmtudisc = IP_PMTUDISC_DO;
sk->sk_clockid = CLOCK_MONOTONIC;
per_cpu(ipv4_tcp_sk, cpu) = sk;
}
if (register_pernet_subsys(&tcp_sk_ops))

View File

@ -1301,7 +1301,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
tp = tcp_sk(sk);
prior_wstamp = tp->tcp_wstamp_ns;
tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true);
skb_set_delivery_time(skb, tp->tcp_wstamp_ns, SKB_CLOCK_MONOTONIC);
if (clone_it) {
oskb = skb;
@ -1655,7 +1655,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
skb_split(skb, buff, len);
skb_set_delivery_time(buff, skb->tstamp, true);
skb_set_delivery_time(buff, skb->tstamp, SKB_CLOCK_MONOTONIC);
tcp_fragment_tstamp(skb, buff);
old_factor = tcp_skb_pcount(skb);
@ -2764,7 +2764,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
tp->tcp_wstamp_ns = tp->tcp_clock_cache;
skb_set_delivery_time(skb, tp->tcp_wstamp_ns, true);
skb_set_delivery_time(skb, tp->tcp_wstamp_ns, SKB_CLOCK_MONOTONIC);
list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
tcp_init_tso_segs(skb, mss_now);
goto repair; /* Skip network transmission */
@ -3752,11 +3752,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
#ifdef CONFIG_SYN_COOKIES
if (unlikely(synack_type == TCP_SYNACK_COOKIE && ireq->tstamp_ok))
skb_set_delivery_time(skb, cookie_init_timestamp(req, now),
true);
SKB_CLOCK_MONOTONIC);
else
#endif
{
skb_set_delivery_time(skb, now, true);
skb_set_delivery_time(skb, now, SKB_CLOCK_MONOTONIC);
if (!tcp_rsk(req)->snt_synack) /* Timestamp first SYNACK */
tcp_rsk(req)->snt_synack = tcp_skb_timestamp_us(skb);
}
@ -3843,7 +3843,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
synack_type, &opts);
skb_set_delivery_time(skb, now, true);
skb_set_delivery_time(skb, now, SKB_CLOCK_MONOTONIC);
tcp_add_tx_delay(skb, tp);
return skb;
@ -4027,7 +4027,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation);
skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, true);
skb_set_delivery_time(syn, syn_data->skb_mstamp_ns, SKB_CLOCK_MONOTONIC);
/* Now full SYN+DATA was cloned and sent (or not),
* remove the SYN from the original skb (syn_data)

View File

@ -859,7 +859,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
struct rt6_info *rt = dst_rt6_info(skb_dst(skb));
struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ?
inet6_sk(skb->sk) : NULL;
bool mono_delivery_time = skb->mono_delivery_time;
u8 tstamp_type = skb->tstamp_type;
struct ip6_frag_state state;
unsigned int mtu, hlen, nexthdr_offset;
ktime_t tstamp = skb->tstamp;
@ -955,7 +955,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
if (iter.frag)
ip6_fraglist_prepare(skb, &iter);
skb_set_delivery_time(skb, tstamp, mono_delivery_time);
skb_set_delivery_time(skb, tstamp, tstamp_type);
err = output(net, sk, skb);
if (!err)
IP6_INC_STATS(net, ip6_dst_idev(&rt->dst),
@ -1016,7 +1016,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
/*
* Put this fragment into the sending queue.
*/
skb_set_delivery_time(frag, tstamp, mono_delivery_time);
skb_set_delivery_time(frag, tstamp, tstamp_type);
err = output(net, sk, frag);
if (err)
goto fail;
@ -1924,7 +1924,10 @@ struct sk_buff *__ip6_make_skb(struct sock *sk,
skb->priority = READ_ONCE(sk->sk_priority);
skb->mark = cork->base.mark;
skb->tstamp = cork->base.transmit_time;
if (sk_is_tcp(sk))
skb_set_delivery_time(skb, cork->base.transmit_time, SKB_CLOCK_MONOTONIC);
else
skb_set_delivery_type_by_clockid(skb, cork->base.transmit_time, sk->sk_clockid);
ip6_cork_steal_dst(skb, cork);
IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTREQUESTS);

View File

@ -126,7 +126,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
struct sk_buff *))
{
int frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
bool mono_delivery_time = skb->mono_delivery_time;
u8 tstamp_type = skb->tstamp_type;
ktime_t tstamp = skb->tstamp;
struct ip6_frag_state state;
u8 *prevhdr, nexthdr = 0;
@ -192,7 +192,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
if (iter.frag)
ip6_fraglist_prepare(skb, &iter);
skb_set_delivery_time(skb, tstamp, mono_delivery_time);
skb_set_delivery_time(skb, tstamp, tstamp_type);
err = output(net, sk, data, skb);
if (err || !iter.frag)
break;
@ -225,7 +225,7 @@ int br_ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
goto blackhole;
}
skb_set_delivery_time(skb2, tstamp, mono_delivery_time);
skb_set_delivery_time(skb2, tstamp, tstamp_type);
err = output(net, sk, data, skb2);
if (err)
goto blackhole;

View File

@ -263,7 +263,7 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb,
fq->iif = dev->ifindex;
fq->q.stamp = skb->tstamp;
fq->q.mono_delivery_time = skb->mono_delivery_time;
fq->q.tstamp_type = skb->tstamp_type;
fq->q.meat += skb->len;
fq->ecn |= ecn;
if (payload_len > fq->q.max_size)

View File

@ -621,7 +621,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
skb->protocol = htons(ETH_P_IPV6);
skb->priority = READ_ONCE(sk->sk_priority);
skb->mark = sockc->mark;
skb->tstamp = sockc->transmit_time;
skb_set_delivery_type_by_clockid(skb, sockc->transmit_time, sk->sk_clockid);
skb_put(skb, length);
skb_reset_network_header(skb);

View File

@ -198,7 +198,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
fq->iif = dev->ifindex;
fq->q.stamp = skb->tstamp;
fq->q.mono_delivery_time = skb->mono_delivery_time;
fq->q.tstamp_type = skb->tstamp_type;
fq->q.meat += skb->len;
fq->ecn |= ecn;
add_frag_mem_limit(fq->q.fqdir, skb->truesize);

View File

@ -975,7 +975,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
mark = inet_twsk(sk)->tw_mark;
else
mark = READ_ONCE(sk->sk_mark);
skb_set_delivery_time(buff, tcp_transmit_time(sk), true);
skb_set_delivery_time(buff, tcp_transmit_time(sk), SKB_CLOCK_MONOTONIC);
}
if (txhash) {
/* autoflowlabel/skb_get_hash_flowi6 rely on buff->hash */
@ -2387,8 +2387,14 @@ static struct inet_protosw tcpv6_protosw = {
static int __net_init tcpv6_net_init(struct net *net)
{
return inet_ctl_sock_create(&net->ipv6.tcp_sk, PF_INET6,
SOCK_RAW, IPPROTO_TCP, net);
int res;
res = inet_ctl_sock_create(&net->ipv6.tcp_sk, PF_INET6,
SOCK_RAW, IPPROTO_TCP, net);
if (!res)
net->ipv6.tcp_sk->sk_clockid = CLOCK_MONOTONIC;
return res;
}
static void __net_exit tcpv6_net_exit(struct net *net)

View File

@ -32,7 +32,9 @@
* -EINVAL - Passed NULL for bpf_tuple pointer
* -EINVAL - opts->reserved is not 0
* -EINVAL - netns_id is less than -1
* -EINVAL - opts__sz isn't NF_BPF_CT_OPTS_SZ (12)
* -EINVAL - opts__sz isn't NF_BPF_CT_OPTS_SZ (16) or 12
* -EINVAL - opts->ct_zone_id set when
opts__sz isn't NF_BPF_CT_OPTS_SZ (16)
* -EPROTO - l4proto isn't one of IPPROTO_TCP or IPPROTO_UDP
* -ENONET - No network namespace found for netns_id
* -ENOENT - Conntrack lookup could not find entry for tuple
@ -42,6 +44,8 @@
* Values:
* IPPROTO_TCP, IPPROTO_UDP
* @dir: - connection tracking tuple direction.
* @ct_zone_id - connection tracking zone id.
* @ct_zone_dir - connection tracking zone direction.
* @reserved - Reserved member, will be reused for more options in future
* Values:
* 0
@ -51,11 +55,13 @@ struct bpf_ct_opts {
s32 error;
u8 l4proto;
u8 dir;
u8 reserved[2];
u16 ct_zone_id;
u8 ct_zone_dir;
u8 reserved[3];
};
enum {
NF_BPF_CT_OPTS_SZ = 12,
NF_BPF_CT_OPTS_SZ = 16,
};
static int bpf_nf_ct_tuple_parse(struct bpf_sock_tuple *bpf_tuple,
@ -104,12 +110,21 @@ __bpf_nf_ct_alloc_entry(struct net *net, struct bpf_sock_tuple *bpf_tuple,
u32 timeout)
{
struct nf_conntrack_tuple otuple, rtuple;
struct nf_conntrack_zone ct_zone;
struct nf_conn *ct;
int err;
if (!opts || !bpf_tuple || opts->reserved[0] || opts->reserved[1] ||
opts_len != NF_BPF_CT_OPTS_SZ)
if (!opts || !bpf_tuple)
return ERR_PTR(-EINVAL);
if (!(opts_len == NF_BPF_CT_OPTS_SZ || opts_len == 12))
return ERR_PTR(-EINVAL);
if (opts_len == NF_BPF_CT_OPTS_SZ) {
if (opts->reserved[0] || opts->reserved[1] || opts->reserved[2])
return ERR_PTR(-EINVAL);
} else {
if (opts->ct_zone_id)
return ERR_PTR(-EINVAL);
}
if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS))
return ERR_PTR(-EINVAL);
@ -130,7 +145,16 @@ __bpf_nf_ct_alloc_entry(struct net *net, struct bpf_sock_tuple *bpf_tuple,
return ERR_PTR(-ENONET);
}
ct = nf_conntrack_alloc(net, &nf_ct_zone_dflt, &otuple, &rtuple,
if (opts_len == NF_BPF_CT_OPTS_SZ) {
if (opts->ct_zone_dir == 0)
opts->ct_zone_dir = NF_CT_DEFAULT_ZONE_DIR;
nf_ct_zone_init(&ct_zone,
opts->ct_zone_id, opts->ct_zone_dir, 0);
} else {
ct_zone = nf_ct_zone_dflt;
}
ct = nf_conntrack_alloc(net, &ct_zone, &otuple, &rtuple,
GFP_ATOMIC);
if (IS_ERR(ct))
goto out;
@ -152,12 +176,21 @@ static struct nf_conn *__bpf_nf_ct_lookup(struct net *net,
{
struct nf_conntrack_tuple_hash *hash;
struct nf_conntrack_tuple tuple;
struct nf_conntrack_zone ct_zone;
struct nf_conn *ct;
int err;
if (!opts || !bpf_tuple || opts->reserved[0] || opts->reserved[1] ||
opts_len != NF_BPF_CT_OPTS_SZ)
if (!opts || !bpf_tuple)
return ERR_PTR(-EINVAL);
if (!(opts_len == NF_BPF_CT_OPTS_SZ || opts_len == 12))
return ERR_PTR(-EINVAL);
if (opts_len == NF_BPF_CT_OPTS_SZ) {
if (opts->reserved[0] || opts->reserved[1] || opts->reserved[2])
return ERR_PTR(-EINVAL);
} else {
if (opts->ct_zone_id)
return ERR_PTR(-EINVAL);
}
if (unlikely(opts->l4proto != IPPROTO_TCP && opts->l4proto != IPPROTO_UDP))
return ERR_PTR(-EPROTO);
if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS))
@ -174,7 +207,16 @@ static struct nf_conn *__bpf_nf_ct_lookup(struct net *net,
return ERR_PTR(-ENONET);
}
hash = nf_conntrack_find_get(net, &nf_ct_zone_dflt, &tuple);
if (opts_len == NF_BPF_CT_OPTS_SZ) {
if (opts->ct_zone_dir == 0)
opts->ct_zone_dir = NF_CT_DEFAULT_ZONE_DIR;
nf_ct_zone_init(&ct_zone,
opts->ct_zone_id, opts->ct_zone_dir, 0);
} else {
ct_zone = nf_ct_zone_dflt;
}
hash = nf_conntrack_find_get(net, &ct_zone, &tuple);
if (opts->netns_id >= 0)
put_net(net);
if (!hash)
@ -245,7 +287,7 @@ __bpf_kfunc_start_defs();
* @opts - Additional options for allocation (documented above)
* Cannot be NULL
* @opts__sz - Length of the bpf_ct_opts structure
* Must be NF_BPF_CT_OPTS_SZ (12)
* Must be NF_BPF_CT_OPTS_SZ (16) or 12
*/
__bpf_kfunc struct nf_conn___init *
bpf_xdp_ct_alloc(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple,
@ -279,7 +321,7 @@ bpf_xdp_ct_alloc(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple,
* @opts - Additional options for lookup (documented above)
* Cannot be NULL
* @opts__sz - Length of the bpf_ct_opts structure
* Must be NF_BPF_CT_OPTS_SZ (12)
* Must be NF_BPF_CT_OPTS_SZ (16) or 12
*/
__bpf_kfunc struct nf_conn *
bpf_xdp_ct_lookup(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple,
@ -312,7 +354,7 @@ bpf_xdp_ct_lookup(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple,
* @opts - Additional options for allocation (documented above)
* Cannot be NULL
* @opts__sz - Length of the bpf_ct_opts structure
* Must be NF_BPF_CT_OPTS_SZ (12)
* Must be NF_BPF_CT_OPTS_SZ (16) or 12
*/
__bpf_kfunc struct nf_conn___init *
bpf_skb_ct_alloc(struct __sk_buff *skb_ctx, struct bpf_sock_tuple *bpf_tuple,
@ -347,7 +389,7 @@ bpf_skb_ct_alloc(struct __sk_buff *skb_ctx, struct bpf_sock_tuple *bpf_tuple,
* @opts - Additional options for lookup (documented above)
* Cannot be NULL
* @opts__sz - Length of the bpf_ct_opts structure
* Must be NF_BPF_CT_OPTS_SZ (12)
* Must be NF_BPF_CT_OPTS_SZ (16) or 12
*/
__bpf_kfunc struct nf_conn *
bpf_skb_ct_lookup(struct __sk_buff *skb_ctx, struct bpf_sock_tuple *bpf_tuple,

View File

@ -2056,8 +2056,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
skb->dev = dev;
skb->priority = READ_ONCE(sk->sk_priority);
skb->mark = READ_ONCE(sk->sk_mark);
skb->tstamp = sockc.transmit_time;
skb_set_delivery_type_by_clockid(skb, sockc.transmit_time, sk->sk_clockid);
skb_setup_tx_timestamp(skb, sockc.tsflags);
if (unlikely(extra_len == 4))
@ -2584,7 +2583,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
skb->dev = dev;
skb->priority = READ_ONCE(po->sk.sk_priority);
skb->mark = READ_ONCE(po->sk.sk_mark);
skb->tstamp = sockc->transmit_time;
skb_set_delivery_type_by_clockid(skb, sockc->transmit_time, po->sk.sk_clockid);
skb_setup_tx_timestamp(skb, sockc->tsflags);
skb_zcopy_set_nouarg(skb, ph.raw);
@ -3062,7 +3061,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
skb->dev = dev;
skb->priority = READ_ONCE(sk->sk_priority);
skb->mark = sockc.mark;
skb->tstamp = sockc.transmit_time;
skb_set_delivery_type_by_clockid(skb, sockc.transmit_time, sk->sk_clockid);
if (unlikely(extra_len == 4))
skb->no_fcs = 1;

View File

@ -54,8 +54,8 @@ TC_INDIRECT_SCOPE int tcf_bpf_act(struct sk_buff *skb,
bpf_compute_data_pointers(skb);
filter_res = bpf_prog_run(filter, skb);
}
if (unlikely(!skb->tstamp && skb->mono_delivery_time))
skb->mono_delivery_time = 0;
if (unlikely(!skb->tstamp && skb->tstamp_type))
skb->tstamp_type = SKB_CLOCK_REALTIME;
if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK)
skb_orphan(skb);

View File

@ -104,8 +104,8 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
bpf_compute_data_pointers(skb);
filter_res = bpf_prog_run(prog->filter, skb);
}
if (unlikely(!skb->tstamp && skb->mono_delivery_time))
skb->mono_delivery_time = 0;
if (unlikely(!skb->tstamp && skb->tstamp_type))
skb->tstamp_type = SKB_CLOCK_REALTIME;
if (prog->exts_integrated) {
res->class = 0;

View File

@ -211,7 +211,7 @@ int bpf_prog1(struct cpu_args *ctx)
SEC("tracepoint/power/cpu_frequency")
int bpf_prog2(struct cpu_args *ctx)
{
u64 *pts, *cstate, *pstate, prev_state, cur_ts, delta;
u64 *pts, *cstate, *pstate, cur_ts, delta;
u32 key, cpu, pstate_idx;
u64 *val;
@ -232,7 +232,6 @@ int bpf_prog2(struct cpu_args *ctx)
if (!cstate)
return 0;
prev_state = *pstate;
*pstate = ctx->state;
if (!*pts) {

View File

@ -14,9 +14,7 @@ pahole-flags-$(call test-ge, $(pahole-ver), 121) += --btf_gen_floats
pahole-flags-$(call test-ge, $(pahole-ver), 122) += -j
ifeq ($(pahole-ver), 125)
pahole-flags-y += --skip_encoding_btf_inconsistent_proto --btf_gen_optimized
endif
pahole-flags-$(call test-ge, $(pahole-ver), 125) += --skip_encoding_btf_inconsistent_proto --btf_gen_optimized
else

View File

@ -28,7 +28,7 @@ BTF COMMANDS
| **bpftool** **btf help**
|
| *BTF_SRC* := { **id** *BTF_ID* | **prog** *PROG* | **map** *MAP* [{**key** | **value** | **kv** | **all**}] | **file** *FILE* }
| *FORMAT* := { **raw** | **c** }
| *FORMAT* := { **raw** | **c** [**unsorted**] }
| *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
| *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* | **name** *PROG_NAME* }
@ -63,7 +63,9 @@ bpftool btf dump *BTF_SRC*
pahole.
**format** option can be used to override default (raw) output format. Raw
(**raw**) or C-syntax (**c**) output formats are supported.
(**raw**) or C-syntax (**c**) output formats are supported. With C-style
formatting, the output is sorted by default. Use the **unsorted** option
to avoid sorting the output.
bpftool btf help
Print short help message.

View File

@ -204,10 +204,11 @@ ifeq ($(feature-clang-bpf-co-re),1)
BUILD_BPF_SKELS := 1
$(OUTPUT)vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL_BOOTSTRAP)
ifeq ($(VMLINUX_H),)
$(OUTPUT)vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL_BOOTSTRAP)
$(QUIET_GEN)$(BPFTOOL_BOOTSTRAP) btf dump file $< format c > $@
else
$(OUTPUT)vmlinux.h: $(VMLINUX_H)
$(Q)cp "$(VMLINUX_H)" $@
endif

View File

@ -930,6 +930,9 @@ _bpftool()
format)
COMPREPLY=( $( compgen -W "c raw" -- "$cur" ) )
;;
c)
COMPREPLY=( $( compgen -W "unsorted" -- "$cur" ) )
;;
*)
# emit extra options
case ${words[3]} in

View File

@ -43,6 +43,13 @@ static const char * const btf_kind_str[NR_BTF_KINDS] = {
[BTF_KIND_ENUM64] = "ENUM64",
};
struct sort_datum {
int index;
int type_rank;
const char *sort_name;
const char *own_name;
};
static const char *btf_int_enc_str(__u8 encoding)
{
switch (encoding) {
@ -460,9 +467,122 @@ static void __printf(2, 0) btf_dump_printf(void *ctx,
vfprintf(stdout, fmt, args);
}
static int dump_btf_c(const struct btf *btf,
__u32 *root_type_ids, int root_type_cnt)
static int btf_type_rank(const struct btf *btf, __u32 index, bool has_name)
{
const struct btf_type *t = btf__type_by_id(btf, index);
const int kind = btf_kind(t);
const int max_rank = 10;
if (t->name_off)
has_name = true;
switch (kind) {
case BTF_KIND_ENUM:
case BTF_KIND_ENUM64:
return has_name ? 1 : 0;
case BTF_KIND_INT:
case BTF_KIND_FLOAT:
return 2;
case BTF_KIND_STRUCT:
case BTF_KIND_UNION:
return has_name ? 3 : max_rank;
case BTF_KIND_FUNC_PROTO:
return has_name ? 4 : max_rank;
case BTF_KIND_ARRAY:
if (has_name)
return btf_type_rank(btf, btf_array(t)->type, has_name);
return max_rank;
case BTF_KIND_TYPE_TAG:
case BTF_KIND_CONST:
case BTF_KIND_PTR:
case BTF_KIND_VOLATILE:
case BTF_KIND_RESTRICT:
case BTF_KIND_TYPEDEF:
case BTF_KIND_DECL_TAG:
if (has_name)
return btf_type_rank(btf, t->type, has_name);
return max_rank;
default:
return max_rank;
}
}
static const char *btf_type_sort_name(const struct btf *btf, __u32 index, bool from_ref)
{
const struct btf_type *t = btf__type_by_id(btf, index);
switch (btf_kind(t)) {
case BTF_KIND_ENUM:
case BTF_KIND_ENUM64: {
int name_off = t->name_off;
/* Use name of the first element for anonymous enums if allowed */
if (!from_ref && !t->name_off && btf_vlen(t))
name_off = btf_enum(t)->name_off;
return btf__name_by_offset(btf, name_off);
}
case BTF_KIND_ARRAY:
return btf_type_sort_name(btf, btf_array(t)->type, true);
case BTF_KIND_TYPE_TAG:
case BTF_KIND_CONST:
case BTF_KIND_PTR:
case BTF_KIND_VOLATILE:
case BTF_KIND_RESTRICT:
case BTF_KIND_TYPEDEF:
case BTF_KIND_DECL_TAG:
return btf_type_sort_name(btf, t->type, true);
default:
return btf__name_by_offset(btf, t->name_off);
}
return NULL;
}
static int btf_type_compare(const void *left, const void *right)
{
const struct sort_datum *d1 = (const struct sort_datum *)left;
const struct sort_datum *d2 = (const struct sort_datum *)right;
int r;
if (d1->type_rank != d2->type_rank)
return d1->type_rank < d2->type_rank ? -1 : 1;
r = strcmp(d1->sort_name, d2->sort_name);
if (r)
return r;
return strcmp(d1->own_name, d2->own_name);
}
static struct sort_datum *sort_btf_c(const struct btf *btf)
{
struct sort_datum *datums;
int n;
n = btf__type_cnt(btf);
datums = malloc(sizeof(struct sort_datum) * n);
if (!datums)
return NULL;
for (int i = 0; i < n; ++i) {
struct sort_datum *d = datums + i;
const struct btf_type *t = btf__type_by_id(btf, i);
d->index = i;
d->type_rank = btf_type_rank(btf, i, false);
d->sort_name = btf_type_sort_name(btf, i, false);
d->own_name = btf__name_by_offset(btf, t->name_off);
}
qsort(datums, n, sizeof(struct sort_datum), btf_type_compare);
return datums;
}
static int dump_btf_c(const struct btf *btf,
__u32 *root_type_ids, int root_type_cnt, bool sort_dump)
{
struct sort_datum *datums = NULL;
struct btf_dump *d;
int err = 0, i;
@ -486,8 +606,12 @@ static int dump_btf_c(const struct btf *btf,
} else {
int cnt = btf__type_cnt(btf);
if (sort_dump)
datums = sort_btf_c(btf);
for (i = 1; i < cnt; i++) {
err = btf_dump__dump_type(d, i);
int idx = datums ? datums[i].index : i;
err = btf_dump__dump_type(d, idx);
if (err)
goto done;
}
@ -500,6 +624,7 @@ static int dump_btf_c(const struct btf *btf,
printf("#endif /* __VMLINUX_H__ */\n");
done:
free(datums);
btf_dump__free(d);
return err;
}
@ -549,10 +674,10 @@ static bool btf_is_kernel_module(__u32 btf_id)
static int do_dump(int argc, char **argv)
{
bool dump_c = false, sort_dump_c = true;
struct btf *btf = NULL, *base = NULL;
__u32 root_type_ids[2];
int root_type_cnt = 0;
bool dump_c = false;
__u32 btf_id = -1;
const char *src;
int fd = -1;
@ -663,6 +788,9 @@ static int do_dump(int argc, char **argv)
goto done;
}
NEXT_ARG();
} else if (is_prefix(*argv, "unsorted")) {
sort_dump_c = false;
NEXT_ARG();
} else {
p_err("unrecognized option: '%s'", *argv);
err = -EINVAL;
@ -691,7 +819,7 @@ static int do_dump(int argc, char **argv)
err = -ENOTSUP;
goto done;
}
err = dump_btf_c(btf, root_type_ids, root_type_cnt);
err = dump_btf_c(btf, root_type_ids, root_type_cnt, sort_dump_c);
} else {
err = dump_btf_raw(btf, root_type_ids, root_type_cnt);
}
@ -1063,7 +1191,7 @@ static int do_help(int argc, char **argv)
" %1$s %2$s help\n"
"\n"
" BTF_SRC := { id BTF_ID | prog PROG | map MAP [{key | value | kv | all}] | file FILE }\n"
" FORMAT := { raw | c }\n"
" FORMAT := { raw | c [unsorted] }\n"
" " HELP_SPEC_MAP "\n"
" " HELP_SPEC_PROGRAM "\n"
" " HELP_SPEC_OPTIONS " |\n"

View File

@ -410,7 +410,7 @@ void get_prog_full_name(const struct bpf_prog_info *prog_info, int prog_fd,
{
const char *prog_name = prog_info->name;
const struct btf_type *func_type;
const struct bpf_func_info finfo = {};
struct bpf_func_info finfo = {};
struct bpf_prog_info info = {};
__u32 info_len = sizeof(info);
struct btf *prog_btf = NULL;

View File

@ -6207,12 +6207,17 @@ union { \
__u64 :64; \
} __attribute__((aligned(8)))
/* The enum used in skb->tstamp_type. It specifies the clock type
* of the time stored in the skb->tstamp.
*/
enum {
BPF_SKB_TSTAMP_UNSPEC,
BPF_SKB_TSTAMP_DELIVERY_MONO, /* tstamp has mono delivery time */
/* For any BPF_SKB_TSTAMP_* that the bpf prog cannot handle,
* the bpf prog should handle it like BPF_SKB_TSTAMP_UNSPEC
* and try to deduce it by ingress, egress or skb->sk->sk_clockid.
BPF_SKB_TSTAMP_UNSPEC = 0, /* DEPRECATED */
BPF_SKB_TSTAMP_DELIVERY_MONO = 1, /* DEPRECATED */
BPF_SKB_CLOCK_REALTIME = 0,
BPF_SKB_CLOCK_MONOTONIC = 1,
BPF_SKB_CLOCK_TAI = 2,
/* For any future BPF_SKB_CLOCK_* that the bpf prog cannot handle,
* the bpf prog can try to deduce it by ingress/egress/skb->sk->sk_clockid.
*/
};

View File

@ -80,6 +80,7 @@ CONFIG_NETFILTER_XT_TARGET_CT=y
CONFIG_NETKIT=y
CONFIG_NF_CONNTRACK=y
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_ZONES=y
CONFIG_NF_DEFRAG_IPV4=y
CONFIG_NF_DEFRAG_IPV6=y
CONFIG_NF_NAT=y

View File

@ -104,6 +104,7 @@ static void test_bpf_nf_ct(int mode)
ASSERT_EQ(skel->bss->test_einval_bpf_tuple, -EINVAL, "Test EINVAL for NULL bpf_tuple");
ASSERT_EQ(skel->bss->test_einval_reserved, -EINVAL, "Test EINVAL for reserved not set to 0");
ASSERT_EQ(skel->bss->test_einval_reserved_new, -EINVAL, "Test EINVAL for reserved in new struct not set to 0");
ASSERT_EQ(skel->bss->test_einval_netns_id, -EINVAL, "Test EINVAL for netns_id < -1");
ASSERT_EQ(skel->bss->test_einval_len_opts, -EINVAL, "Test EINVAL for len__opts != NF_BPF_CT_OPTS_SZ");
ASSERT_EQ(skel->bss->test_eproto_l4proto, -EPROTO, "Test EPROTO for l4proto != TCP or UDP");
@ -122,6 +123,12 @@ static void test_bpf_nf_ct(int mode)
ASSERT_EQ(skel->bss->test_exist_lookup_mark, 43, "Test existing connection lookup ctmark");
ASSERT_EQ(skel->data->test_snat_addr, 0, "Test for source natting");
ASSERT_EQ(skel->data->test_dnat_addr, 0, "Test for destination natting");
ASSERT_EQ(skel->data->test_ct_zone_id_alloc_entry, 0, "Test for alloc new entry in specified ct zone");
ASSERT_EQ(skel->data->test_ct_zone_id_insert_entry, 0, "Test for insert new entry in specified ct zone");
ASSERT_EQ(skel->data->test_ct_zone_id_succ_lookup, 0, "Test for successful lookup in specified ct_zone");
ASSERT_EQ(skel->bss->test_ct_zone_dir_enoent_lookup, -ENOENT, "Test ENOENT for lookup with wrong ct zone dir");
ASSERT_EQ(skel->bss->test_ct_zone_id_enoent_lookup, -ENOENT, "Test ENOENT for lookup in wrong ct zone");
end:
if (client_fd != -1)
close(client_fd);

View File

@ -69,15 +69,17 @@ static struct test_case test_cases[] = {
{
N(SCHED_CLS, struct __sk_buff, tstamp),
.read = "r11 = *(u8 *)($ctx + sk_buff::__mono_tc_offset);"
"w11 &= 3;"
"if w11 != 0x3 goto pc+2;"
"if w11 & 0x4 goto pc+1;"
"goto pc+4;"
"if w11 & 0x3 goto pc+1;"
"goto pc+2;"
"$dst = 0;"
"goto pc+1;"
"$dst = *(u64 *)($ctx + sk_buff::tstamp);",
.write = "r11 = *(u8 *)($ctx + sk_buff::__mono_tc_offset);"
"if w11 & 0x2 goto pc+1;"
"if w11 & 0x4 goto pc+1;"
"goto pc+2;"
"w11 &= -2;"
"w11 &= -4;"
"*(u8 *)($ctx + sk_buff::__mono_tc_offset) = r11;"
"*(u64 *)($ctx + sk_buff::tstamp) = $src;",
},

View File

@ -890,9 +890,6 @@ static void test_udp_dtime(struct test_tc_dtime *skel, int family, bool bpf_fwd)
ASSERT_EQ(dtimes[INGRESS_FWDNS_P100], 0,
dtime_cnt_str(t, INGRESS_FWDNS_P100));
/* non mono delivery time is not forwarded */
ASSERT_EQ(dtimes[INGRESS_FWDNS_P101], 0,
dtime_cnt_str(t, INGRESS_FWDNS_P101));
for (i = EGRESS_FWDNS_P100; i < SET_DTIME; i++)
ASSERT_GT(dtimes[i], 0, dtime_cnt_str(t, i));

View File

@ -9,10 +9,14 @@
#define EINVAL 22
#define ENOENT 2
#define NF_CT_ZONE_DIR_ORIG (1 << IP_CT_DIR_ORIGINAL)
#define NF_CT_ZONE_DIR_REPL (1 << IP_CT_DIR_REPLY)
extern unsigned long CONFIG_HZ __kconfig;
int test_einval_bpf_tuple = 0;
int test_einval_reserved = 0;
int test_einval_reserved_new = 0;
int test_einval_netns_id = 0;
int test_einval_len_opts = 0;
int test_eproto_l4proto = 0;
@ -22,6 +26,11 @@ int test_eafnosupport = 0;
int test_alloc_entry = -EINVAL;
int test_insert_entry = -EAFNOSUPPORT;
int test_succ_lookup = -ENOENT;
int test_ct_zone_id_alloc_entry = -EINVAL;
int test_ct_zone_id_insert_entry = -EAFNOSUPPORT;
int test_ct_zone_id_succ_lookup = -ENOENT;
int test_ct_zone_dir_enoent_lookup = 0;
int test_ct_zone_id_enoent_lookup = 0;
u32 test_delta_timeout = 0;
u32 test_status = 0;
u32 test_insert_lookup_mark = 0;
@ -45,6 +54,17 @@ struct bpf_ct_opts___local {
s32 netns_id;
s32 error;
u8 l4proto;
u8 dir;
u8 reserved[2];
};
struct bpf_ct_opts___new {
s32 netns_id;
s32 error;
u8 l4proto;
u8 dir;
u16 ct_zone_id;
u8 ct_zone_dir;
u8 reserved[3];
} __attribute__((preserve_access_index));
@ -220,10 +240,97 @@ nf_ct_test(struct nf_conn *(*lookup_fn)(void *, struct bpf_sock_tuple *, u32,
}
}
static __always_inline void
nf_ct_opts_new_test(struct nf_conn *(*lookup_fn)(void *, struct bpf_sock_tuple *, u32,
struct bpf_ct_opts___new *, u32),
struct nf_conn *(*alloc_fn)(void *, struct bpf_sock_tuple *, u32,
struct bpf_ct_opts___new *, u32),
void *ctx)
{
struct bpf_ct_opts___new opts_def = { .l4proto = IPPROTO_TCP, .netns_id = -1 };
struct bpf_sock_tuple bpf_tuple;
struct nf_conn *ct;
__builtin_memset(&bpf_tuple, 0, sizeof(bpf_tuple.ipv4));
opts_def.reserved[0] = 1;
ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
sizeof(opts_def));
opts_def.reserved[0] = 0;
if (ct)
bpf_ct_release(ct);
else
test_einval_reserved_new = opts_def.error;
bpf_tuple.ipv4.saddr = bpf_get_prandom_u32(); /* src IP */
bpf_tuple.ipv4.daddr = bpf_get_prandom_u32(); /* dst IP */
bpf_tuple.ipv4.sport = bpf_get_prandom_u32(); /* src port */
bpf_tuple.ipv4.dport = bpf_get_prandom_u32(); /* dst port */
/* use non-default ct zone */
opts_def.ct_zone_id = 10;
opts_def.ct_zone_dir = NF_CT_ZONE_DIR_ORIG;
ct = alloc_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
sizeof(opts_def));
if (ct) {
__u16 sport = bpf_get_prandom_u32();
__u16 dport = bpf_get_prandom_u32();
union nf_inet_addr saddr = {};
union nf_inet_addr daddr = {};
struct nf_conn *ct_ins;
bpf_ct_set_timeout(ct, 10000);
/* snat */
saddr.ip = bpf_get_prandom_u32();
bpf_ct_set_nat_info(ct, &saddr, sport, NF_NAT_MANIP_SRC___local);
/* dnat */
daddr.ip = bpf_get_prandom_u32();
bpf_ct_set_nat_info(ct, &daddr, dport, NF_NAT_MANIP_DST___local);
ct_ins = bpf_ct_insert_entry(ct);
if (ct_ins) {
struct nf_conn *ct_lk;
/* entry should exist in same ct zone we inserted it */
ct_lk = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4),
&opts_def, sizeof(opts_def));
if (ct_lk) {
bpf_ct_release(ct_lk);
test_ct_zone_id_succ_lookup = 0;
}
/* entry should not exist with wrong direction */
opts_def.ct_zone_dir = NF_CT_ZONE_DIR_REPL;
ct_lk = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4),
&opts_def, sizeof(opts_def));
opts_def.ct_zone_dir = NF_CT_ZONE_DIR_ORIG;
if (ct_lk)
bpf_ct_release(ct_lk);
else
test_ct_zone_dir_enoent_lookup = opts_def.error;
/* entry should not exist in default ct zone */
opts_def.ct_zone_id = 0;
ct_lk = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4),
&opts_def, sizeof(opts_def));
if (ct_lk)
bpf_ct_release(ct_lk);
else
test_ct_zone_id_enoent_lookup = opts_def.error;
bpf_ct_release(ct_ins);
test_ct_zone_id_insert_entry = 0;
}
test_ct_zone_id_alloc_entry = 0;
}
}
SEC("xdp")
int nf_xdp_ct_test(struct xdp_md *ctx)
{
nf_ct_test((void *)bpf_xdp_ct_lookup, (void *)bpf_xdp_ct_alloc, ctx);
nf_ct_opts_new_test((void *)bpf_xdp_ct_lookup, (void *)bpf_xdp_ct_alloc, ctx);
return 0;
}
@ -231,6 +338,7 @@ SEC("tc")
int nf_skb_ct_test(struct __sk_buff *ctx)
{
nf_ct_test((void *)bpf_skb_ct_lookup, (void *)bpf_skb_ct_alloc, ctx);
nf_ct_opts_new_test((void *)bpf_skb_ct_lookup, (void *)bpf_skb_ct_alloc, ctx);
return 0;
}

View File

@ -222,16 +222,20 @@ int egress_host(struct __sk_buff *skb)
return TC_ACT_OK;
if (skb_proto(skb_type) == IPPROTO_TCP) {
if (skb->tstamp_type == BPF_SKB_TSTAMP_DELIVERY_MONO &&
if (skb->tstamp_type == BPF_SKB_CLOCK_MONOTONIC &&
skb->tstamp)
inc_dtimes(EGRESS_ENDHOST);
else
inc_errs(EGRESS_ENDHOST);
} else if (skb_proto(skb_type) == IPPROTO_UDP) {
if (skb->tstamp_type == BPF_SKB_CLOCK_TAI &&
skb->tstamp)
inc_dtimes(EGRESS_ENDHOST);
else
inc_errs(EGRESS_ENDHOST);
} else {
if (skb->tstamp_type == BPF_SKB_TSTAMP_UNSPEC &&
if (skb->tstamp_type == BPF_SKB_CLOCK_REALTIME &&
skb->tstamp)
inc_dtimes(EGRESS_ENDHOST);
else
inc_errs(EGRESS_ENDHOST);
}
@ -252,7 +256,7 @@ int ingress_host(struct __sk_buff *skb)
if (!skb_type)
return TC_ACT_OK;
if (skb->tstamp_type == BPF_SKB_TSTAMP_DELIVERY_MONO &&
if (skb->tstamp_type == BPF_SKB_CLOCK_MONOTONIC &&
skb->tstamp == EGRESS_FWDNS_MAGIC)
inc_dtimes(INGRESS_ENDHOST);
else
@ -315,7 +319,6 @@ int egress_fwdns_prio100(struct __sk_buff *skb)
SEC("tc")
int ingress_fwdns_prio101(struct __sk_buff *skb)
{
__u64 expected_dtime = EGRESS_ENDHOST_MAGIC;
int skb_type;
skb_type = skb_get_type(skb);
@ -323,29 +326,24 @@ int ingress_fwdns_prio101(struct __sk_buff *skb)
/* Should have handled in prio100 */
return TC_ACT_SHOT;
if (skb_proto(skb_type) == IPPROTO_UDP)
expected_dtime = 0;
if (skb->tstamp_type) {
if (fwdns_clear_dtime() ||
skb->tstamp_type != BPF_SKB_TSTAMP_DELIVERY_MONO ||
skb->tstamp != expected_dtime)
(skb->tstamp_type != BPF_SKB_CLOCK_MONOTONIC &&
skb->tstamp_type != BPF_SKB_CLOCK_TAI) ||
skb->tstamp != EGRESS_ENDHOST_MAGIC)
inc_errs(INGRESS_FWDNS_P101);
else
inc_dtimes(INGRESS_FWDNS_P101);
} else {
if (!fwdns_clear_dtime() && expected_dtime)
if (!fwdns_clear_dtime())
inc_errs(INGRESS_FWDNS_P101);
}
if (skb->tstamp_type == BPF_SKB_TSTAMP_DELIVERY_MONO) {
if (skb->tstamp_type == BPF_SKB_CLOCK_MONOTONIC) {
skb->tstamp = INGRESS_FWDNS_MAGIC;
} else {
if (bpf_skb_set_tstamp(skb, INGRESS_FWDNS_MAGIC,
BPF_SKB_TSTAMP_DELIVERY_MONO))
inc_errs(SET_DTIME);
if (!bpf_skb_set_tstamp(skb, INGRESS_FWDNS_MAGIC,
BPF_SKB_TSTAMP_UNSPEC))
BPF_SKB_CLOCK_MONOTONIC))
inc_errs(SET_DTIME);
}
@ -370,7 +368,7 @@ int egress_fwdns_prio101(struct __sk_buff *skb)
if (skb->tstamp_type) {
if (fwdns_clear_dtime() ||
skb->tstamp_type != BPF_SKB_TSTAMP_DELIVERY_MONO ||
skb->tstamp_type != BPF_SKB_CLOCK_MONOTONIC ||
skb->tstamp != INGRESS_FWDNS_MAGIC)
inc_errs(EGRESS_FWDNS_P101);
else
@ -380,14 +378,11 @@ int egress_fwdns_prio101(struct __sk_buff *skb)
inc_errs(EGRESS_FWDNS_P101);
}
if (skb->tstamp_type == BPF_SKB_TSTAMP_DELIVERY_MONO) {
if (skb->tstamp_type == BPF_SKB_CLOCK_MONOTONIC) {
skb->tstamp = EGRESS_FWDNS_MAGIC;
} else {
if (bpf_skb_set_tstamp(skb, EGRESS_FWDNS_MAGIC,
BPF_SKB_TSTAMP_DELIVERY_MONO))
inc_errs(SET_DTIME);
if (!bpf_skb_set_tstamp(skb, INGRESS_FWDNS_MAGIC,
BPF_SKB_TSTAMP_UNSPEC))
BPF_SKB_CLOCK_MONOTONIC))
inc_errs(SET_DTIME);
}

View File

@ -63,7 +63,7 @@ int passed;
int failed;
int map_fd[9];
struct bpf_map *maps[9];
int prog_fd[11];
int prog_fd[9];
int txmsg_pass;
int txmsg_redir;
@ -1793,8 +1793,6 @@ int prog_attach_type[] = {
BPF_SK_MSG_VERDICT,
BPF_SK_MSG_VERDICT,
BPF_SK_MSG_VERDICT,
BPF_SK_MSG_VERDICT,
BPF_SK_MSG_VERDICT,
};
int prog_type[] = {
@ -1807,8 +1805,6 @@ int prog_type[] = {
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_SK_MSG,
};
static int populate_progs(char *bpf_file)