Networking changes for 6.3.

Core
 ----
 
  - Add dedicated kmem_cache for typical/small skb->head, avoid having
    to access struct page at kfree time, and improve memory use.
 
  - Introduce sysctl to set default RPS configuration for new netdevs.
 
  - Define Netlink protocol specification format which can be used
    to describe messages used by each family and auto-generate parsers.
    Add tools for generating kernel data structures and uAPI headers.
 
  - Expose all net/core sysctls inside netns.
 
  - Remove 4s sleep in netpoll if carrier is instantly detected on boot.
 
  - Add configurable limit of MDB entries per port, and port-vlan.
 
  - Continue populating drop reasons throughout the stack.
 
  - Retire a handful of legacy Qdiscs and classifiers.
 
 Protocols
 ---------
 
  - Support IPv4 big TCP (TSO frames larger than 64kB).
 
  - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
    on socket by socket basis.
 
  - Track and report in procfs number of MPTCP sockets used.
 
  - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP
    path manager.
 
  - IPv6: don't check net.ipv6.route.max_size and rely on garbage
    collection to free memory (similarly to IPv4).
 
  - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
 
  - ICMP: add per-rate limit counters.
 
  - Add support for user scanning requests in ieee802154.
 
  - Remove static WEP support.
 
  - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
    reporting.
 
  - WiFi 7 EHT channel puncturing support (client & AP).
 
 BPF
 ---
 
  - Add a rbtree data structure following the "next-gen data structure"
    precedent set by recently added linked list, that is, by using
    kfunc + kptr instead of adding a new BPF map type.
 
  - Expose XDP hints via kfuncs with initial support for RX hash and
    timestamp metadata.
 
  - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key
    to better support decap on GRE tunnel devices not operating
    in collect metadata.
 
  - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
 
  - Remove the need for trace_printk_lock for bpf_trace_printk
    and bpf_trace_vprintk helpers.
 
  - Extend libbpf's bpf_tracing.h support for tracing arguments of
    kprobes/uprobes and syscall as a special case.
 
  - Significantly reduce the search time for module symbols
    by livepatch and BPF.
 
  - Enable cpumasks to be used as kptrs, which is useful for tracing
    programs tracking which tasks end up running on which CPUs in
    different time intervals.
 
  - Add support for BPF trampoline on s390x and riscv64.
 
  - Add capability to export the XDP features supported by the NIC.
 
  - Add __bpf_kfunc tag for marking kernel functions as kfuncs.
 
  - Add cgroup.memory=nobpf kernel parameter option to disable BPF
    memory accounting for container environments.
 
 Netfilter
 ---------
 
  - Remove the CLUSTERIP target. It has been marked as obsolete
    for years, and we still have WARN splats wrt. races of
    the out-of-band /proc interface installed by this target.
 
  - Add 'destroy' commands to nf_tables. They are identical to
    the existing 'delete' commands, but do not return an error if
    the referenced object (set, chain, rule...) did not exist.
 
 Driver API
 ----------
 
  - Improve cpumask_local_spread() locality to help NICs set the right
    IRQ affinity on AMD platforms.
 
  - Separate C22 and C45 MDIO bus transactions more clearly.
 
  - Introduce new DCB table to control DSCP rewrite on egress.
 
  - Support configuration of Physical Layer Collision Avoidance (PLCA)
    Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
    shared medium Ethernet.
 
  - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
    preemption of low priority frames by high priority frames.
 
  - Add support for controlling MACSec offload using netlink SET.
 
  - Rework devlink instance refcounts to allow registration and
    de-registration under the instance lock. Split the code into multiple
    files, drop some of the unnecessarily granular locks and factor out
    common parts of netlink operation handling.
 
  - Add TX frame aggregation parameters (for USB drivers).
 
  - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
    messages with notifications for debug.
 
  - Allow offloading of UDP NEW connections via act_ct.
 
  - Add support for per action HW stats in TC.
 
  - Support hardware miss to TC action (continue processing in SW from
    a specific point in the action chain).
 
  - Warn if old Wireless Extension user space interface is used with
    modern cfg80211/mac80211 drivers. Do not support Wireless Extensions
    for Wi-Fi 7 devices at all. Everyone should switch to using nl80211
    interface instead.
 
  - Improve the CAN bit timing configuration. Use extack to return error
    messages directly to user space, update the SJW handling, including
    the definition of a new default value that will benefit CAN-FD
    controllers, by increasing their oscillator tolerance.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - nVidia BlueField-3 support (control traffic driver)
    - Ethernet support for imx93 SoCs
    - Motorcomm yt8531 gigabit Ethernet PHY
    - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
    - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
    - Amlogic gxl MDIO mux
 
  - WiFi:
    - RealTek RTL8188EU (rtl8xxxu)
    - Qualcomm Wi-Fi 7 devices (ath12k)
 
  - CAN:
    - Renesas R-Car V4H
 
 Drivers
 -------
 
  - Bluetooth:
    - Set Per Platform Antenna Gain (PPAG) for Intel controllers.
 
  - Ethernet NICs:
    - Intel (1G, igc):
      - support TSN / Qbv / packet scheduling features of i226 model
    - Intel (100G, ice):
      - use GNSS subsystem instead of TTY
      - multi-buffer XDP support
      - extend support for GPIO pins to E823 devices
    - nVidia/Mellanox:
      - update the shared buffer configuration on PFC commands
      - implement PTP adjphase function for HW offset control
      - TC support for Geneve and GRE with VF tunnel offload
      - more efficient crypto key management method
      - multi-port eswitch support
    - Netronome/Corigine:
      - add DCB IEEE support
      - support IPsec offloading for NFP3800
    - Freescale/NXP (enetc):
      - enetc: support XDP_REDIRECT for XDP non-linear buffers
      - enetc: improve reconfig, avoid link flap and waiting for idle
      - enetc: support MAC Merge layer
    - Other NICs:
      - sfc/ef100: add basic devlink support for ef100
      - ionic: rx_push mode operation (writing descriptors via MMIO)
      - bnxt: use the auxiliary bus abstraction for RDMA
      - r8169: disable ASPM and reset bus in case of tx timeout
      - cpsw: support QSGMII mode for J721e CPSW9G
      - cpts: support pulse-per-second output
      - ngbe: add an mdio bus driver
      - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
      - r8152: handle devices with FW with NCM support
      - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
      - virtio-net: support multi buffer XDP
      - virtio/vsock: replace virtio_vsock_pkt with sk_buff
      - tsnep: XDP support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add support for latency TLV (in FW control messages)
    - Microchip (sparx5):
      - separate explicit and implicit traffic forwarding rules, make
        the implicit rules always active
      - add support for egress DSCP rewrite
      - IS0 VCAP support (Ingress Classification)
      - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.)
      - ES2 VCAP support (Egress Access Control)
      - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1)
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - add MAB (port auth) offload support
      - enable PTP receive for mv88e6390
    - NXP (ocelot):
      - support MAC Merge layer
      - support for the the vsc7512 internal copper phys
    - Microchip:
      - lan9303: convert to PHYLINK
      - lan966x: support TC flower filter statistics
      - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
      - lan937x: support Credit Based Shaper configuration
      - ksz9477: support Energy Efficient Ethernet
    - other:
      - qca8k: convert to regmap read/write API, use bulk operations
      - rswitch: Improve TX timestamp accuracy
 
  - Intel WiFi (iwlwifi):
    - EHT (Wi-Fi 7) rate reporting
    - STEP equalizer support: transfer some STEP (connection to radio
      on platforms with integrated wifi) related parameters from the
      BIOS to the firmware.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - IPQ5018 support
    - Fine Timing Measurement (FTM) responder role support
    - channel 177 support
 
  - MediaTek WiFi (mt76):
    - per-PHY LED support
    - mt7996: EHT (Wi-Fi 7) support
    - Wireless Ethernet Dispatch (WED) reset support
    - switch to using page pool allocator
 
  - RealTek WiFi (rtw89):
    - support new version of Bluetooth co-existance
 
  - Mobile:
    - rmnet: support TX aggregation.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S
 IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK
 nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ
 7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el
 n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW
 9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl
 leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4
 LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP
 n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC
 xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc
 edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1
 Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI=
 =xXhC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Add dedicated kmem_cache for typical/small skb->head, avoid having
     to access struct page at kfree time, and improve memory use.

   - Introduce sysctl to set default RPS configuration for new netdevs.

   - Define Netlink protocol specification format which can be used to
     describe messages used by each family and auto-generate parsers.
     Add tools for generating kernel data structures and uAPI headers.

   - Expose all net/core sysctls inside netns.

   - Remove 4s sleep in netpoll if carrier is instantly detected on
     boot.

   - Add configurable limit of MDB entries per port, and port-vlan.

   - Continue populating drop reasons throughout the stack.

   - Retire a handful of legacy Qdiscs and classifiers.

  Protocols:

   - Support IPv4 big TCP (TSO frames larger than 64kB).

   - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
     on socket by socket basis.

   - Track and report in procfs number of MPTCP sockets used.

   - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
     manager.

   - IPv6: don't check net.ipv6.route.max_size and rely on garbage
     collection to free memory (similarly to IPv4).

   - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).

   - ICMP: add per-rate limit counters.

   - Add support for user scanning requests in ieee802154.

   - Remove static WEP support.

   - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
     reporting.

   - WiFi 7 EHT channel puncturing support (client & AP).

  BPF:

   - Add a rbtree data structure following the "next-gen data structure"
     precedent set by recently added linked list, that is, by using
     kfunc + kptr instead of adding a new BPF map type.

   - Expose XDP hints via kfuncs with initial support for RX hash and
     timestamp metadata.

   - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
     better support decap on GRE tunnel devices not operating in collect
     metadata.

   - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.

   - Remove the need for trace_printk_lock for bpf_trace_printk and
     bpf_trace_vprintk helpers.

   - Extend libbpf's bpf_tracing.h support for tracing arguments of
     kprobes/uprobes and syscall as a special case.

   - Significantly reduce the search time for module symbols by
     livepatch and BPF.

   - Enable cpumasks to be used as kptrs, which is useful for tracing
     programs tracking which tasks end up running on which CPUs in
     different time intervals.

   - Add support for BPF trampoline on s390x and riscv64.

   - Add capability to export the XDP features supported by the NIC.

   - Add __bpf_kfunc tag for marking kernel functions as kfuncs.

   - Add cgroup.memory=nobpf kernel parameter option to disable BPF
     memory accounting for container environments.

  Netfilter:

   - Remove the CLUSTERIP target. It has been marked as obsolete for
     years, and we still have WARN splats wrt races of the out-of-band
     /proc interface installed by this target.

   - Add 'destroy' commands to nf_tables. They are identical to the
     existing 'delete' commands, but do not return an error if the
     referenced object (set, chain, rule...) did not exist.

  Driver API:

   - Improve cpumask_local_spread() locality to help NICs set the right
     IRQ affinity on AMD platforms.

   - Separate C22 and C45 MDIO bus transactions more clearly.

   - Introduce new DCB table to control DSCP rewrite on egress.

   - Support configuration of Physical Layer Collision Avoidance (PLCA)
     Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
     shared medium Ethernet.

   - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
     preemption of low priority frames by high priority frames.

   - Add support for controlling MACSec offload using netlink SET.

   - Rework devlink instance refcounts to allow registration and
     de-registration under the instance lock. Split the code into
     multiple files, drop some of the unnecessarily granular locks and
     factor out common parts of netlink operation handling.

   - Add TX frame aggregation parameters (for USB drivers).

   - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
     messages with notifications for debug.

   - Allow offloading of UDP NEW connections via act_ct.

   - Add support for per action HW stats in TC.

   - Support hardware miss to TC action (continue processing in SW from
     a specific point in the action chain).

   - Warn if old Wireless Extension user space interface is used with
     modern cfg80211/mac80211 drivers. Do not support Wireless
     Extensions for Wi-Fi 7 devices at all. Everyone should switch to
     using nl80211 interface instead.

   - Improve the CAN bit timing configuration. Use extack to return
     error messages directly to user space, update the SJW handling,
     including the definition of a new default value that will benefit
     CAN-FD controllers, by increasing their oscillator tolerance.

  New hardware / drivers:

   - Ethernet:
      - nVidia BlueField-3 support (control traffic driver)
      - Ethernet support for imx93 SoCs
      - Motorcomm yt8531 gigabit Ethernet PHY
      - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
      - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
      - Amlogic gxl MDIO mux

   - WiFi:
      - RealTek RTL8188EU (rtl8xxxu)
      - Qualcomm Wi-Fi 7 devices (ath12k)

   - CAN:
      - Renesas R-Car V4H

  Drivers:

   - Bluetooth:
      - Set Per Platform Antenna Gain (PPAG) for Intel controllers.

   - Ethernet NICs:
      - Intel (1G, igc):
         - support TSN / Qbv / packet scheduling features of i226 model
      - Intel (100G, ice):
         - use GNSS subsystem instead of TTY
         - multi-buffer XDP support
         - extend support for GPIO pins to E823 devices
      - nVidia/Mellanox:
         - update the shared buffer configuration on PFC commands
         - implement PTP adjphase function for HW offset control
         - TC support for Geneve and GRE with VF tunnel offload
         - more efficient crypto key management method
         - multi-port eswitch support
      - Netronome/Corigine:
         - add DCB IEEE support
         - support IPsec offloading for NFP3800
      - Freescale/NXP (enetc):
         - support XDP_REDIRECT for XDP non-linear buffers
         - improve reconfig, avoid link flap and waiting for idle
         - support MAC Merge layer
      - Other NICs:
         - sfc/ef100: add basic devlink support for ef100
         - ionic: rx_push mode operation (writing descriptors via MMIO)
         - bnxt: use the auxiliary bus abstraction for RDMA
         - r8169: disable ASPM and reset bus in case of tx timeout
         - cpsw: support QSGMII mode for J721e CPSW9G
         - cpts: support pulse-per-second output
         - ngbe: add an mdio bus driver
         - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
         - r8152: handle devices with FW with NCM support
         - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
         - virtio-net: support multi buffer XDP
         - virtio/vsock: replace virtio_vsock_pkt with sk_buff
         - tsnep: XDP support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add support for latency TLV (in FW control messages)
      - Microchip (sparx5):
         - separate explicit and implicit traffic forwarding rules, make
           the implicit rules always active
         - add support for egress DSCP rewrite
         - IS0 VCAP support (Ingress Classification)
         - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
           etc.)
         - ES2 VCAP support (Egress Access Control)
         - support for Per-Stream Filtering and Policing (802.1Q,
           8.6.5.1)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - add MAB (port auth) offload support
         - enable PTP receive for mv88e6390
      - NXP (ocelot):
         - support MAC Merge layer
         - support for the the vsc7512 internal copper phys
      - Microchip:
         - lan9303: convert to PHYLINK
         - lan966x: support TC flower filter statistics
         - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
         - lan937x: support Credit Based Shaper configuration
         - ksz9477: support Energy Efficient Ethernet
      - other:
         - qca8k: convert to regmap read/write API, use bulk operations
         - rswitch: Improve TX timestamp accuracy

   - Intel WiFi (iwlwifi):
      - EHT (Wi-Fi 7) rate reporting
      - STEP equalizer support: transfer some STEP (connection to radio
        on platforms with integrated wifi) related parameters from the
        BIOS to the firmware.

   - Qualcomm 802.11ax WiFi (ath11k):
      - IPQ5018 support
      - Fine Timing Measurement (FTM) responder role support
      - channel 177 support

   - MediaTek WiFi (mt76):
      - per-PHY LED support
      - mt7996: EHT (Wi-Fi 7) support
      - Wireless Ethernet Dispatch (WED) reset support
      - switch to using page pool allocator

   - RealTek WiFi (rtw89):
      - support new version of Bluetooth co-existance

   - Mobile:
      - rmnet: support TX aggregation"

* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
  page_pool: add a comment explaining the fragment counter usage
  net: ethtool: fix __ethtool_dev_mm_supported() implementation
  ethtool: pse-pd: Fix double word in comments
  xsk: add linux/vmalloc.h to xsk.c
  sefltests: netdevsim: wait for devlink instance after netns removal
  selftest: fib_tests: Always cleanup before exit
  net/mlx5e: Align IPsec ASO result memory to be as required by hardware
  net/mlx5e: TC, Set CT miss to the specific ct action instance
  net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
  net/mlx5: Refactor tc miss handling to a single function
  net/mlx5: Kconfig: Make tc offload depend on tc skb extension
  net/sched: flower: Support hardware miss to tc action
  net/sched: flower: Move filter handle initialization earlier
  net/sched: cls_api: Support hardware miss to tc action
  net/sched: Rename user cookie and act cookie
  sfc: fix builds without CONFIG_RTC_LIB
  sfc: clean up some inconsistent indentings
  net/mlx4_en: Introduce flexible array to silence overflow warning
  net: lan966x: Fix possible deadlock inside PTP
  net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
  ...
This commit is contained in:
Linus Torvalds 2023-02-21 18:24:12 -08:00
commit 5b7c4cabbb
1823 changed files with 156333 additions and 43468 deletions

View File

@ -0,0 +1,19 @@
What: /sys/class/net/<iface>/peak_usb/can_channel_id
Date: November 2022
KernelVersion: 6.2
Contact: Stephane Grosjean <s.grosjean@peak-system.com>
Description:
PEAK PCAN-USB devices support user-configurable CAN channel
identifiers. Contrary to a USB serial number, these identifiers
are writable and can be set per CAN interface. This means that
if a USB device exports multiple CAN interfaces, each of them
can be assigned a unique channel ID.
This attribute provides read-only access to the currently
configured value of the channel identifier. Depending on the
device type, the identifier has a length of 8 or 32 bit. The
value read from this attribute is always an 8 digit 32 bit
hexadecimal value in big endian format. If the device only
supports an 8 bit identifier, the upper 24 bit of the value are
set to zero.

View File

@ -557,6 +557,7 @@
Format: <string>
nosocket -- Disable socket memory accounting.
nokmem -- Disable kernel memory accounting.
nobpf -- Disable BPF memory accounting.
checkreqprot= [SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }

View File

@ -215,6 +215,12 @@ rmem_max
The maximum receive socket buffer size in bytes.
rps_default_mask
----------------
The default RPS CPU mask used on newly created network devices. An empty
mask means RPS disabled by default.
tstamp_allow_data
-----------------
Allow processes to receive tx timestamps looped together with the original

View File

@ -208,6 +208,10 @@ data structures and compile with kernel internal headers. Both of these
kernel internals are subject to change and can break with newer kernels
such that the program needs to be adapted accordingly.
New BPF functionality is generally added through the use of kfuncs instead of
new helpers. Kfuncs are not considered part of the stable API, and have their own
lifecycle expectations as described in :ref:`BPF_kfunc_lifecycle_expectations`.
Q: Are tracepoints part of the stable ABI?
------------------------------------------
A: NO. Tracepoints are tied to internal implementation details hence they are
@ -236,8 +240,8 @@ A: NO. Classic BPF programs are converted into extend BPF instructions.
Q: Can BPF call arbitrary kernel functions?
-------------------------------------------
A: NO. BPF programs can only call a set of helper functions which
is defined for every program type.
A: NO. BPF programs can only call specific functions exposed as BPF helpers or
kfuncs. The set of available functions is defined for every program type.
Q: Can BPF overwrite arbitrary kernel memory?
---------------------------------------------
@ -263,7 +267,12 @@ Q: New functionality via kernel modules?
Q: Can BPF functionality such as new program or map types, new
helpers, etc be added out of kernel module code?
A: NO.
A: Yes, through kfuncs and kptrs
The core BPF functionality such as program types, maps and helpers cannot be
added to by modules. However, modules can expose functionality to BPF programs
by exporting kfuncs (which may return pointers to module-internal data
structures as kptrs).
Q: Directly calling kernel function is an ABI?
----------------------------------------------
@ -278,7 +287,8 @@ kernel functions have already been used by other kernel tcp
cc (congestion-control) implementations. If any of these kernel
functions has changed, both the in-tree and out-of-tree kernel tcp cc
implementations have to be changed. The same goes for the bpf
programs and they have to be adjusted accordingly.
programs and they have to be adjusted accordingly. See
:ref:`BPF_kfunc_lifecycle_expectations` for details.
Q: Attaching to arbitrary kernel functions is an ABI?
-----------------------------------------------------
@ -340,6 +350,7 @@ compatibility for these features?
A: NO.
Unlike map value types, there are no stability guarantees for this case. The
whole API to work with allocated objects and any support for special fields
inside them is unstable (since it is exposed through kfuncs).
Unlike map value types, the API to work with allocated objects and any support
for special fields inside them is exposed through kfuncs, and thus has the same
lifecycle expectations as the kfuncs themselves. See
:ref:`BPF_kfunc_lifecycle_expectations` for details.

View File

@ -0,0 +1,393 @@
.. SPDX-License-Identifier: GPL-2.0
.. _cpumasks-header-label:
==================
BPF cpumask kfuncs
==================
1. Introduction
===============
``struct cpumask`` is a bitmap data structure in the kernel whose indices
reflect the CPUs on the system. Commonly, cpumasks are used to track which CPUs
a task is affinitized to, but they can also be used to e.g. track which cores
are associated with a scheduling domain, which cores on a machine are idle,
etc.
BPF provides programs with a set of :ref:`kfuncs-header-label` that can be
used to allocate, mutate, query, and free cpumasks.
2. BPF cpumask objects
======================
There are two different types of cpumasks that can be used by BPF programs.
2.1 ``struct bpf_cpumask *``
----------------------------
``struct bpf_cpumask *`` is a cpumask that is allocated by BPF, on behalf of a
BPF program, and whose lifecycle is entirely controlled by BPF. These cpumasks
are RCU-protected, can be mutated, can be used as kptrs, and can be safely cast
to a ``struct cpumask *``.
2.1.1 ``struct bpf_cpumask *`` lifecycle
----------------------------------------
A ``struct bpf_cpumask *`` is allocated, acquired, and released, using the
following functions:
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_create
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_acquire
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_release
For example:
.. code-block:: c
struct cpumask_map_value {
struct bpf_cpumask __kptr_ref * cpumask;
};
struct array_map {
__uint(type, BPF_MAP_TYPE_ARRAY);
__type(key, int);
__type(value, struct cpumask_map_value);
__uint(max_entries, 65536);
} cpumask_map SEC(".maps");
static int cpumask_map_insert(struct bpf_cpumask *mask, u32 pid)
{
struct cpumask_map_value local, *v;
long status;
struct bpf_cpumask *old;
u32 key = pid;
local.cpumask = NULL;
status = bpf_map_update_elem(&cpumask_map, &key, &local, 0);
if (status) {
bpf_cpumask_release(mask);
return status;
}
v = bpf_map_lookup_elem(&cpumask_map, &key);
if (!v) {
bpf_cpumask_release(mask);
return -ENOENT;
}
old = bpf_kptr_xchg(&v->cpumask, mask);
if (old)
bpf_cpumask_release(old);
return 0;
}
/**
* A sample tracepoint showing how a task's cpumask can be queried and
* recorded as a kptr.
*/
SEC("tp_btf/task_newtask")
int BPF_PROG(record_task_cpumask, struct task_struct *task, u64 clone_flags)
{
struct bpf_cpumask *cpumask;
int ret;
cpumask = bpf_cpumask_create();
if (!cpumask)
return -ENOMEM;
if (!bpf_cpumask_full(task->cpus_ptr))
bpf_printk("task %s has CPU affinity", task->comm);
bpf_cpumask_copy(cpumask, task->cpus_ptr);
return cpumask_map_insert(cpumask, task->pid);
}
----
2.1.1 ``struct bpf_cpumask *`` as kptrs
---------------------------------------
As mentioned and illustrated above, these ``struct bpf_cpumask *`` objects can
also be stored in a map and used as kptrs. If a ``struct bpf_cpumask *`` is in
a map, the reference can be removed from the map with bpf_kptr_xchg(), or
opportunistically acquired with bpf_cpumask_kptr_get():
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_kptr_get
Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
.. code-block:: c
/* struct containing the struct bpf_cpumask kptr which is stored in the map. */
struct cpumasks_kfunc_map_value {
struct bpf_cpumask __kptr_ref * bpf_cpumask;
};
/* The map containing struct cpumasks_kfunc_map_value entries. */
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__type(key, int);
__type(value, struct cpumasks_kfunc_map_value);
__uint(max_entries, 1);
} cpumasks_kfunc_map SEC(".maps");
/* ... */
/**
* A simple example tracepoint program showing how a
* struct bpf_cpumask * kptr that is stored in a map can
* be acquired using the bpf_cpumask_kptr_get() kfunc.
*/
SEC("tp_btf/cgroup_mkdir")
int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
{
struct bpf_cpumask *kptr;
struct cpumasks_kfunc_map_value *v;
u32 key = 0;
/* Assume a bpf_cpumask * kptr was previously stored in the map. */
v = bpf_map_lookup_elem(&cpumasks_kfunc_map, &key);
if (!v)
return -ENOENT;
/* Acquire a reference to the bpf_cpumask * kptr that's already stored in the map. */
kptr = bpf_cpumask_kptr_get(&v->cpumask);
if (!kptr)
/* If no bpf_cpumask was present in the map, it's because
* we're racing with another CPU that removed it with
* bpf_kptr_xchg() between the bpf_map_lookup_elem()
* above, and our call to bpf_cpumask_kptr_get().
* bpf_cpumask_kptr_get() internally safely handles this
* race, and will return NULL if the cpumask is no longer
* present in the map by the time we invoke the kfunc.
*/
return -EBUSY;
/* Free the reference we just took above. Note that the
* original struct bpf_cpumask * kptr is still in the map. It will
* be freed either at a later time if another context deletes
* it from the map, or automatically by the BPF subsystem if
* it's still present when the map is destroyed.
*/
bpf_cpumask_release(kptr);
return 0;
}
----
2.2 ``struct cpumask``
----------------------
``struct cpumask`` is the object that actually contains the cpumask bitmap
being queried, mutated, etc. A ``struct bpf_cpumask`` wraps a ``struct
cpumask``, which is why it's safe to cast it as such (note however that it is
**not** safe to cast a ``struct cpumask *`` to a ``struct bpf_cpumask *``, and
the verifier will reject any program that tries to do so).
As we'll see below, any kfunc that mutates its cpumask argument will take a
``struct bpf_cpumask *`` as that argument. Any argument that simply queries the
cpumask will instead take a ``struct cpumask *``.
3. cpumask kfuncs
=================
Above, we described the kfuncs that can be used to allocate, acquire, release,
etc a ``struct bpf_cpumask *``. This section of the document will describe the
kfuncs for mutating and querying cpumasks.
3.1 Mutating cpumasks
---------------------
Some cpumask kfuncs are "read-only" in that they don't mutate any of their
arguments, whereas others mutate at least one argument (which means that the
argument must be a ``struct bpf_cpumask *``, as described above).
This section will describe all of the cpumask kfuncs which mutate at least one
argument. :ref:`cpumasks-querying-label` below describes the read-only kfuncs.
3.1.1 Setting and clearing CPUs
-------------------------------
bpf_cpumask_set_cpu() and bpf_cpumask_clear_cpu() can be used to set and clear
a CPU in a ``struct bpf_cpumask`` respectively:
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_set_cpu bpf_cpumask_clear_cpu
These kfuncs are pretty straightforward, and can be used, for example, as
follows:
.. code-block:: c
/**
* A sample tracepoint showing how a cpumask can be queried.
*/
SEC("tp_btf/task_newtask")
int BPF_PROG(test_set_clear_cpu, struct task_struct *task, u64 clone_flags)
{
struct bpf_cpumask *cpumask;
cpumask = bpf_cpumask_create();
if (!cpumask)
return -ENOMEM;
bpf_cpumask_set_cpu(0, cpumask);
if (!bpf_cpumask_test_cpu(0, cast(cpumask)))
/* Should never happen. */
goto release_exit;
bpf_cpumask_clear_cpu(0, cpumask);
if (bpf_cpumask_test_cpu(0, cast(cpumask)))
/* Should never happen. */
goto release_exit;
/* struct cpumask * pointers such as task->cpus_ptr can also be queried. */
if (bpf_cpumask_test_cpu(0, task->cpus_ptr))
bpf_printk("task %s can use CPU %d", task->comm, 0);
release_exit:
bpf_cpumask_release(cpumask);
return 0;
}
----
bpf_cpumask_test_and_set_cpu() and bpf_cpumask_test_and_clear_cpu() are
complementary kfuncs that allow callers to atomically test and set (or clear)
CPUs:
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_test_and_set_cpu bpf_cpumask_test_and_clear_cpu
----
We can also set and clear entire ``struct bpf_cpumask *`` objects in one
operation using bpf_cpumask_setall() and bpf_cpumask_clear():
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_setall bpf_cpumask_clear
3.1.2 Operations between cpumasks
---------------------------------
In addition to setting and clearing individual CPUs in a single cpumask,
callers can also perform bitwise operations between multiple cpumasks using
bpf_cpumask_and(), bpf_cpumask_or(), and bpf_cpumask_xor():
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_and bpf_cpumask_or bpf_cpumask_xor
The following is an example of how they may be used. Note that some of the
kfuncs shown in this example will be covered in more detail below.
.. code-block:: c
/**
* A sample tracepoint showing how a cpumask can be mutated using
bitwise operators (and queried).
*/
SEC("tp_btf/task_newtask")
int BPF_PROG(test_and_or_xor, struct task_struct *task, u64 clone_flags)
{
struct bpf_cpumask *mask1, *mask2, *dst1, *dst2;
mask1 = bpf_cpumask_create();
if (!mask1)
return -ENOMEM;
mask2 = bpf_cpumask_create();
if (!mask2) {
bpf_cpumask_release(mask1);
return -ENOMEM;
}
// ...Safely create the other two masks... */
bpf_cpumask_set_cpu(0, mask1);
bpf_cpumask_set_cpu(1, mask2);
bpf_cpumask_and(dst1, (const struct cpumask *)mask1, (const struct cpumask *)mask2);
if (!bpf_cpumask_empty((const struct cpumask *)dst1))
/* Should never happen. */
goto release_exit;
bpf_cpumask_or(dst1, (const struct cpumask *)mask1, (const struct cpumask *)mask2);
if (!bpf_cpumask_test_cpu(0, (const struct cpumask *)dst1))
/* Should never happen. */
goto release_exit;
if (!bpf_cpumask_test_cpu(1, (const struct cpumask *)dst1))
/* Should never happen. */
goto release_exit;
bpf_cpumask_xor(dst2, (const struct cpumask *)mask1, (const struct cpumask *)mask2);
if (!bpf_cpumask_equal((const struct cpumask *)dst1,
(const struct cpumask *)dst2))
/* Should never happen. */
goto release_exit;
release_exit:
bpf_cpumask_release(mask1);
bpf_cpumask_release(mask2);
bpf_cpumask_release(dst1);
bpf_cpumask_release(dst2);
return 0;
}
----
The contents of an entire cpumask may be copied to another using
bpf_cpumask_copy():
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_copy
----
.. _cpumasks-querying-label:
3.2 Querying cpumasks
---------------------
In addition to the above kfuncs, there is also a set of read-only kfuncs that
can be used to query the contents of cpumasks.
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_test_cpu
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_equal bpf_cpumask_intersects bpf_cpumask_subset
bpf_cpumask_empty bpf_cpumask_full
.. kernel-doc:: kernel/bpf/cpumask.c
:identifiers: bpf_cpumask_any bpf_cpumask_any_and
----
Some example usages of these querying kfuncs were shown above. We will not
replicate those exmaples here. Note, however, that all of the aforementioned
kfuncs are tested in `tools/testing/selftests/bpf/progs/cpumask_success.c`_, so
please take a look there if you're looking for more examples of how they can be
used.
.. _tools/testing/selftests/bpf/progs/cpumask_success.c:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/progs/cpumask_success.c
4. Adding BPF cpumask kfuncs
============================
The set of supported BPF cpumask kfuncs are not (yet) a 1-1 match with the
cpumask operations in include/linux/cpumask.h. Any of those cpumask operations
could easily be encapsulated in a new kfunc if and when required. If you'd like
to support a new cpumask operation, please feel free to submit a patch. If you
do add a new cpumask kfunc, please document it here, and add any relevant
selftest testcases to the cpumask selftest suite.

View File

@ -0,0 +1,267 @@
=========================
BPF Graph Data Structures
=========================
This document describes implementation details of new-style "graph" data
structures (linked_list, rbtree), with particular focus on the verifier's
implementation of semantics specific to those data structures.
Although no specific verifier code is referred to in this document, the document
assumes that the reader has general knowledge of BPF verifier internals, BPF
maps, and BPF program writing.
Note that the intent of this document is to describe the current state of
these graph data structures. **No guarantees** of stability for either
semantics or APIs are made or implied here.
.. contents::
:local:
:depth: 2
Introduction
------------
The BPF map API has historically been the main way to expose data structures
of various types for use within BPF programs. Some data structures fit naturally
with the map API (HASH, ARRAY), others less so. Consequentially, programs
interacting with the latter group of data structures can be hard to parse
for kernel programmers without previous BPF experience.
Luckily, some restrictions which necessitated the use of BPF map semantics are
no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
BPF allocator, it is now possible to implement BPF data structures whose API
and semantics more closely match those exposed to the rest of the kernel.
Two such data structures - linked_list and rbtree - have many verification
details in common. Because both have "root"s ("head" for linked_list) and
"node"s, the verifier code and this document refer to common functionality
as "graph_api", "graph_root", "graph_node", etc.
Unless otherwise stated, examples and semantics below apply to both graph data
structures.
Unstable API
------------
Data structures implemented using the BPF map API have historically used BPF
helper functions - either standard map API helpers like ``bpf_map_update_elem``
or map-specific helpers. The new-style graph data structures instead use kfuncs
to define their manipulation helpers. Because there are no stability guarantees
for kfuncs, the API and semantics for these data structures can be evolved in
a way that breaks backwards compatibility if necessary.
Root and node types for the new data structures are opaquely defined in the
``uapi/linux/bpf.h`` header.
Locking
-------
The new-style data structures are intrusive and are defined similarly to their
vanilla kernel counterparts:
.. code-block:: c
struct node_data {
long key;
long data;
struct bpf_rb_node node;
};
struct bpf_spin_lock glock;
struct bpf_rb_root groot __contains(node_data, node);
The "root" type for both linked_list and rbtree expects to be in a map_value
which also contains a ``bpf_spin_lock`` - in the above example both global
variables are placed in a single-value arraymap. The verifier considers this
spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
the same map_value and will enforce that the correct lock is held when
verifying BPF programs that manipulate the tree. Since this lock checking
happens at verification time, there is no runtime penalty.
Non-owning references
---------------------
**Motivation**
Consider the following BPF code:
.. code-block:: c
struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
bpf_spin_lock(&lock);
bpf_rbtree_add(&tree, n); /* PASSED */
bpf_spin_unlock(&lock);
From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
program has ownership of the pointee's (object pointed to by ``n``) lifetime.
The BPF program must pass off ownership before exiting - either via
``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
``bpf_rbtree_add``.
(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
"ownership is acquired" and "ownership is passed", respectively)
What should the verifier do with ``n`` after ownership is passed off? If the
object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
the object is no longer valid. The underlying memory may have been reused for
some other allocation, unmapped, etc.
When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
but that would result in programs with useful, common coding patterns being
rejected, e.g.:
.. code-block:: c
int x;
struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
bpf_spin_lock(&lock);
bpf_rbtree_add(&tree, n); /* PASSED */
x = n->data;
n->data = 42;
bpf_spin_unlock(&lock);
Both the read from and write to ``n->data`` would be rejected. The verifier
can do better, though, by taking advantage of two details:
* Graph data structure APIs can only be used when the ``bpf_spin_lock``
associated with the graph root is held
* Both graph data structures have pointer stability
* Because graph nodes are allocated with ``bpf_obj_new`` and
adding / removing from the root involves fiddling with the
``bpf_{list,rb}_node`` field of the node struct, a graph node will
remain at the same address after either operation.
Because the associated ``bpf_spin_lock`` must be held by any program adding
or removing, if we're in the critical section bounded by that lock, we know
that no other program can add or remove until the end of the critical section.
This combined with pointer stability means that, until the critical section
ends, we can safely access the graph node through ``n`` even after it was used
to pass ownership.
The verifier considers such a reference a *non-owning reference*. The ref
returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
Both terms currently only have meaning in the context of graph nodes and API.
**Details**
Let's enumerate the properties of both types of references.
*owning reference*
* This reference controls the lifetime of the pointee
* Ownership of pointee must be 'released' by passing it to some graph API
kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
* If not released before program ends, verifier considers program invalid
* Access to the pointee's memory will not page fault
*non-owning reference*
* This reference does not own the pointee
* It cannot be used to add the graph node to a graph root, nor ``free``'d via
``bpf_obj_drop``
* No explicit control of lifetime, but can infer valid lifetime based on
non-owning ref existence (see explanation below)
* Access to the pointee's memory will not page fault
From verifier's perspective non-owning references can only exist
between spin_lock and spin_unlock. Why? After spin_unlock another program
can do arbitrary operations on the data structure like removing and ``free``-ing
via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
Or the memory could go away.
To prevent this logic violation all non-owning references are invalidated by the
verifier after a critical section ends. This is necessary to ensure the "will
not page fault" property of non-owning references. So if the verifier hasn't
invalidated a non-owning ref, accessing it will not page fault.
Currently ``bpf_obj_drop`` is not allowed in the critical section, so
if there's a valid non-owning ref, we must be in a critical section, and can
conclude that the ref's memory hasn't been dropped-and- ``free``'d or
dropped-and-reused.
Any reference to a node that is in an rbtree _must_ be non-owning, since
the tree has control of the pointee's lifetime. Similarly, any ref to a node
that isn't in rbtree _must_ be owning. This results in a nice property:
graph API add / remove implementations don't need to check if a node
has already been added (or already removed), as the ownership model
allows the verifier to prevent such a state from being valid by simply checking
types.
However, pointer aliasing poses an issue for the above "nice property".
Consider the following example:
.. code-block:: c
struct node_data *n, *m, *o, *p;
n = bpf_obj_new(typeof(*n)); /* 1 */
bpf_spin_lock(&lock);
bpf_rbtree_add(&tree, n); /* 2 */
m = bpf_rbtree_first(&tree); /* 3 */
o = bpf_rbtree_remove(&tree, n); /* 4 */
p = bpf_rbtree_remove(&tree, m); /* 5 */
bpf_spin_unlock(&lock);
bpf_obj_drop(o);
bpf_obj_drop(p); /* 6 */
Assume the tree is empty before this program runs. If we track verifier state
changes here using numbers in above comments:
1) n is an owning reference
2) n is a non-owning reference, it's been added to the tree
3) n and m are non-owning references, they both point to the same node
4) o is an owning reference, n and m non-owning, all point to same node
5) o and p are owning, n and m non-owning, all point to the same node
6) a double-free has occurred, since o and p point to same node and o was
``free``'d in previous statement
States 4 and 5 violate our "nice property", as there are non-owning refs to
a node which is not in an rbtree. Statement 5 will try to remove a node which
has already been removed as a result of this violation. State 6 is a dangerous
double-free.
At a minimum we should prevent state 6 from being possible. If we can't also
prevent state 5 then we must abandon our "nice property" and check whether a
node has already been removed at runtime.
We prevent both by generalizing the "invalidate non-owning references" behavior
of ``bpf_spin_unlock`` and doing similar invalidation after
``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
* takes an arbitrary node argument
* removes it from the data structure
* returns an owning reference to the removed node
May result in a state where some other non-owning reference points to the same
node. So ``remove``-type kfuncs must be considered a non-owning reference
invalidation point as well.

View File

@ -20,6 +20,7 @@ that goes into great technical depth about the BPF Architecture.
syscall_api
helpers
kfuncs
cpumasks
programs
maps
bpf_prog_run

View File

@ -7,6 +7,11 @@ eBPF Instruction Set Specification, v1.0
This document specifies version 1.0 of the eBPF instruction set.
Documentation conventions
=========================
For brevity, this document uses the type notion "u64", "u32", etc.
to mean an unsigned integer whose width is the specified number of bits.
Registers and calling convention
================================
@ -30,20 +35,56 @@ Instruction encoding
eBPF has two instruction encodings:
* the basic instruction encoding, which uses 64 bits to encode an instruction
* the wide instruction encoding, which appends a second 64-bit immediate value
(imm64) after the basic instruction for a total of 128 bits.
* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
constant) value after the basic instruction for a total of 128 bits.
The basic instruction encoding looks as follows:
The basic instruction encoding is as follows, where MSB and LSB mean the most significant
bits and least significant bits, respectively:
============= ======= =============== ==================== ============
============= ======= ======= ======= ============
32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB)
============= ======= =============== ==================== ============
immediate offset source register destination register opcode
============= ======= =============== ==================== ============
============= ======= ======= ======= ============
imm offset src_reg dst_reg opcode
============= ======= ======= ======= ============
**imm**
signed integer immediate value
**offset**
signed integer offset used with pointer arithmetic
**src_reg**
the source register number (0-10), except where otherwise specified
(`64-bit immediate instructions`_ reuse this field for other purposes)
**dst_reg**
destination register number (0-10)
**opcode**
operation to perform
Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.
As discussed below in `64-bit immediate instructions`_, a 64-bit immediate
instruction uses a 64-bit immediate value that is constructed as follows.
The 64 bits following the basic instruction contain a pseudo instruction
using the same format but with opcode, dst_reg, src_reg, and offset all set to zero,
and imm containing the high 32 bits of the immediate value.
================= ==================
64 bits (MSB) 64 bits (LSB)
================= ==================
basic instruction pseudo instruction
================= ==================
Thus the 64-bit immediate value is constructed as follows:
imm64 = (next_imm << 32) | imm
where 'next_imm' refers to the imm value of the pseudo instruction
following the basic instruction.
Instruction classes
-------------------
@ -71,27 +112,32 @@ For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` an
============== ====== =================
4 bits (MSB) 1 bit 3 bits (LSB)
============== ====== =================
operation code source instruction class
code source instruction class
============== ====== =================
The 4th bit encodes the source operand:
**code**
the operation code, whose meaning varies by instruction class
====== ===== ========================================
**source**
the source operand location, which unless otherwise specified is one of:
====== ===== ==============================================
source value description
====== ===== ========================================
BPF_K 0x00 use 32-bit immediate as source operand
BPF_X 0x08 use 'src_reg' register as source operand
====== ===== ========================================
The four MSB bits store the operation code.
====== ===== ==============================================
BPF_K 0x00 use 32-bit 'imm' value as source operand
BPF_X 0x08 use 'src_reg' register value as source operand
====== ===== ==============================================
**instruction class**
the instruction class (see `Instruction classes`_)
Arithmetic instructions
-----------------------
``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
The 'code' field encodes the operation as below:
The 'code' field encodes the operation as below, where 'src' and 'dst' refer
to the values of the source and destination registers, respectively.
======== ===== ==========================================================
code value description
@ -99,35 +145,49 @@ code value description
BPF_ADD 0x00 dst += src
BPF_SUB 0x10 dst -= src
BPF_MUL 0x20 dst \*= src
BPF_DIV 0x30 dst /= src
BPF_DIV 0x30 dst = (src != 0) ? (dst / src) : 0
BPF_OR 0x40 dst \|= src
BPF_AND 0x50 dst &= src
BPF_LSH 0x60 dst <<= src
BPF_RSH 0x70 dst >>= src
BPF_NEG 0x80 dst = ~src
BPF_MOD 0x90 dst %= src
BPF_MOD 0x90 dst = (src != 0) ? (dst % src) : dst
BPF_XOR 0xa0 dst ^= src
BPF_MOV 0xb0 dst = src
BPF_ARSH 0xc0 sign extending shift right
BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
======== ===== ==========================================================
Underflow and overflow are allowed during arithmetic operations, meaning
the 64-bit or 32-bit value will wrap. If eBPF program execution would
result in division by zero, the destination register is instead set to zero.
If execution would result in modulo by zero, for ``BPF_ALU64`` the value of
the destination register is unchanged whereas for ``BPF_ALU`` the upper
32 bits of the destination register are zeroed.
``BPF_ADD | BPF_X | BPF_ALU`` means::
dst_reg = (u32) dst_reg + (u32) src_reg;
dst = (u32) ((u32) dst + (u32) src)
where '(u32)' indicates that the upper 32 bits are zeroed.
``BPF_ADD | BPF_X | BPF_ALU64`` means::
dst_reg = dst_reg + src_reg
dst = dst + src
``BPF_XOR | BPF_K | BPF_ALU`` means::
dst_reg = (u32) dst_reg ^ (u32) imm32
dst = (u32) dst ^ (u32) imm32
``BPF_XOR | BPF_K | BPF_ALU64`` means::
dst_reg = dst_reg ^ imm32
dst = dst ^ imm32
Also note that the division and modulo operations are unsigned. Thus, for
``BPF_ALU``, 'imm' is first interpreted as an unsigned 32-bit value, whereas
for ``BPF_ALU64``, 'imm' is first sign extended to 64 bits and the result
interpreted as an unsigned 64-bit value. There are no instructions for
signed division or modulo.
Byte swap instructions
~~~~~~~~~~~~~~~~~~~~~~
@ -155,11 +215,11 @@ Examples:
``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::
dst_reg = htole16(dst_reg)
dst = htole16(dst)
``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::
dst_reg = htobe64(dst_reg)
dst = htobe64(dst)
Jump instructions
-----------------
@ -234,15 +294,15 @@ instructions that transfer data between a register and memory.
``BPF_MEM | <size> | BPF_STX`` means::
*(size *) (dst_reg + off) = src_reg
*(size *) (dst + offset) = src
``BPF_MEM | <size> | BPF_ST`` means::
*(size *) (dst_reg + off) = imm32
*(size *) (dst + offset) = imm32
``BPF_MEM | <size> | BPF_LDX`` means::
dst_reg = *(size *) (src_reg + off)
dst = *(size *) (src + offset)
Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
@ -276,11 +336,11 @@ BPF_XOR 0xa0 atomic xor
``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means::
*(u32 *)(dst_reg + off16) += src_reg
*(u32 *)(dst + offset) += src
``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
*(u64 *)(dst_reg + off16) += src_reg
*(u64 *)(dst + offset) += src
In addition to the simple atomic operations, there also is a modifier and
two complex atomic operations:
@ -295,16 +355,16 @@ BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange
The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations. If the ``BPF_FETCH`` flag
is set, then the operation also overwrites ``src_reg`` with the value that
is set, then the operation also overwrites ``src`` with the value that
was in memory before it was modified.
The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value
addressed by ``dst_reg + off``.
The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value
addressed by ``dst + offset``.
The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
``dst_reg + off`` with ``R0``. If they match, the value addressed by
``dst_reg + off`` is replaced with ``src_reg``. In either case, the
value that was at ``dst_reg + off`` before the operation is zero-extended
``dst + offset`` with ``R0``. If they match, the value addressed by
``dst + offset`` is replaced with ``src``. In either case, the
value that was at ``dst + offset`` before the operation is zero-extended
and loaded back to ``R0``.
64-bit immediate instructions
@ -317,7 +377,7 @@ There is currently only one such instruction.
``BPF_LD | BPF_DW | BPF_IMM`` means::
dst_reg = imm64
dst = imm64
Legacy BPF Packet access instructions

View File

@ -1,3 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
.. _kfuncs-header-label:
=============================
BPF Kernel Functions (kfuncs)
=============================
@ -9,7 +13,7 @@ BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux
kernel which are exposed for use by BPF programs. Unlike normal BPF helpers,
kfuncs do not have a stable interface and can change from one kernel release to
another. Hence, BPF programs need to be updated in response to changes in the
kernel.
kernel. See :ref:`BPF_kfunc_lifecycle_expectations` for more information.
2. Defining a kfunc
===================
@ -37,7 +41,7 @@ An example is given below::
__diag_ignore_all("-Wmissing-prototypes",
"Global kfuncs as their definitions will be in BTF");
struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
__bpf_kfunc struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
{
return find_get_task_by_vpid(nr);
}
@ -62,7 +66,7 @@ kfunc with a __tag, where tag may be one of the supported annotations.
This annotation is used to indicate a memory and size pair in the argument list.
An example is given below::
void bpf_memzero(void *mem, int mem__sz)
__bpf_kfunc void bpf_memzero(void *mem, int mem__sz)
{
...
}
@ -82,7 +86,7 @@ safety of the program.
An example is given below::
void *bpf_obj_new(u32 local_type_id__k, ...)
__bpf_kfunc void *bpf_obj_new(u32 local_type_id__k, ...)
{
...
}
@ -121,6 +125,20 @@ flags on a set of kfuncs as follows::
This set encodes the BTF ID of each kfunc listed above, and encodes the flags
along with it. Ofcourse, it is also allowed to specify no flags.
kfunc definitions should also always be annotated with the ``__bpf_kfunc``
macro. This prevents issues such as the compiler inlining the kfunc if it's a
static kernel function, or the function being elided in an LTO build as it's
not used in the rest of the kernel. Developers should not manually add
annotations to their kfunc to prevent these issues. If an annotation is
required to prevent such an issue with your kfunc, it is a bug and should be
added to the definition of the macro so that other kfuncs are similarly
protected. An example is given below::
__bpf_kfunc struct task_struct *bpf_get_task_pid(s32 pid)
{
...
}
2.4.1 KF_ACQUIRE flag
---------------------
@ -163,7 +181,8 @@ KF_ACQUIRE and KF_RET_NULL flags.
The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
indicates that the all pointer arguments are valid, and that all pointers to
BTF objects have been passed in their unmodified form (that is, at a zero
offset, and without having been obtained from walking another pointer).
offset, and without having been obtained from walking another pointer, with one
exception described below).
There are two types of pointers to kernel objects which are considered "valid":
@ -176,6 +195,25 @@ KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset.
The definition of "valid" pointers is subject to change at any time, and has
absolutely no ABI stability guarantees.
As mentioned above, a nested pointer obtained from walking a trusted pointer is
no longer trusted, with one exception. If a struct type has a field that is
guaranteed to be valid as long as its parent pointer is trusted, the
``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as
follows:
.. code-block:: c
BTF_TYPE_SAFE_NESTED(struct task_struct) {
const cpumask_t *cpus_ptr;
};
In other words, you must:
1. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro.
2. Specify the type and name of the trusted nested field. This field must match
the field in the original type definition exactly.
2.4.6 KF_SLEEPABLE flag
-----------------------
@ -200,6 +238,28 @@ single argument which must be a trusted argument or a MEM_RCU pointer.
The argument may have reference count of 0 and the kfunc must take this
into consideration.
.. _KF_deprecated_flag:
2.4.9 KF_DEPRECATED flag
------------------------
The KF_DEPRECATED flag is used for kfuncs which are scheduled to be
changed or removed in a subsequent kernel release. A kfunc that is
marked with KF_DEPRECATED should also have any relevant information
captured in its kernel doc. Such information typically includes the
kfunc's expected remaining lifespan, a recommendation for new
functionality that can replace it if any is available, and possibly a
rationale for why it is being removed.
Note that while on some occasions, a KF_DEPRECATED kfunc may continue to be
supported and have its KF_DEPRECATED flag removed, it is likely to be far more
difficult to remove a KF_DEPRECATED flag after it's been added than it is to
prevent it from being added in the first place. As described in
:ref:`BPF_kfunc_lifecycle_expectations`, users that rely on specific kfuncs are
encouraged to make their use-cases known as early as possible, and participate
in upstream discussions regarding whether to keep, change, deprecate, or remove
those kfuncs if and when such discussions occur.
2.5 Registering the kfuncs
--------------------------
@ -223,14 +283,150 @@ type. An example is shown below::
}
late_initcall(init_subsystem);
3. Core kfuncs
2.6 Specifying no-cast aliases with ___init
--------------------------------------------
The verifier will always enforce that the BTF type of a pointer passed to a
kfunc by a BPF program, matches the type of pointer specified in the kfunc
definition. The verifier, does, however, allow types that are equivalent
according to the C standard to be passed to the same kfunc arg, even if their
BTF_IDs differ.
For example, for the following type definition:
.. code-block:: c
struct bpf_cpumask {
cpumask_t cpumask;
refcount_t usage;
};
The verifier would allow a ``struct bpf_cpumask *`` to be passed to a kfunc
taking a ``cpumask_t *`` (which is a typedef of ``struct cpumask *``). For
instance, both ``struct cpumask *`` and ``struct bpf_cpmuask *`` can be passed
to bpf_cpumask_test_cpu().
In some cases, this type-aliasing behavior is not desired. ``struct
nf_conn___init`` is one such example:
.. code-block:: c
struct nf_conn___init {
struct nf_conn ct;
};
The C standard would consider these types to be equivalent, but it would not
always be safe to pass either type to a trusted kfunc. ``struct
nf_conn___init`` represents an allocated ``struct nf_conn`` object that has
*not yet been initialized*, so it would therefore be unsafe to pass a ``struct
nf_conn___init *`` to a kfunc that's expecting a fully initialized ``struct
nf_conn *`` (e.g. ``bpf_ct_change_timeout()``).
In order to accommodate such requirements, the verifier will enforce strict
PTR_TO_BTF_ID type matching if two types have the exact same name, with one
being suffixed with ``___init``.
.. _BPF_kfunc_lifecycle_expectations:
3. kfunc lifecycle expectations
===============================
kfuncs provide a kernel <-> kernel API, and thus are not bound by any of the
strict stability restrictions associated with kernel <-> user UAPIs. This means
they can be thought of as similar to EXPORT_SYMBOL_GPL, and can therefore be
modified or removed by a maintainer of the subsystem they're defined in when
it's deemed necessary.
Like any other change to the kernel, maintainers will not change or remove a
kfunc without having a reasonable justification. Whether or not they'll choose
to change a kfunc will ultimately depend on a variety of factors, such as how
widely used the kfunc is, how long the kfunc has been in the kernel, whether an
alternative kfunc exists, what the norm is in terms of stability for the
subsystem in question, and of course what the technical cost is of continuing
to support the kfunc.
There are several implications of this:
a) kfuncs that are widely used or have been in the kernel for a long time will
be more difficult to justify being changed or removed by a maintainer. In
other words, kfuncs that are known to have a lot of users and provide
significant value provide stronger incentives for maintainers to invest the
time and complexity in supporting them. It is therefore important for
developers that are using kfuncs in their BPF programs to communicate and
explain how and why those kfuncs are being used, and to participate in
discussions regarding those kfuncs when they occur upstream.
b) Unlike regular kernel symbols marked with EXPORT_SYMBOL_GPL, BPF programs
that call kfuncs are generally not part of the kernel tree. This means that
refactoring cannot typically change callers in-place when a kfunc changes,
as is done for e.g. an upstreamed driver being updated in place when a
kernel symbol is changed.
Unlike with regular kernel symbols, this is expected behavior for BPF
symbols, and out-of-tree BPF programs that use kfuncs should be considered
relevant to discussions and decisions around modifying and removing those
kfuncs. The BPF community will take an active role in participating in
upstream discussions when necessary to ensure that the perspectives of such
users are taken into account.
c) A kfunc will never have any hard stability guarantees. BPF APIs cannot and
will not ever hard-block a change in the kernel purely for stability
reasons. That being said, kfuncs are features that are meant to solve
problems and provide value to users. The decision of whether to change or
remove a kfunc is a multivariate technical decision that is made on a
case-by-case basis, and which is informed by data points such as those
mentioned above. It is expected that a kfunc being removed or changed with
no warning will not be a common occurrence or take place without sound
justification, but it is a possibility that must be accepted if one is to
use kfuncs.
3.1 kfunc deprecation
---------------------
As described above, while sometimes a maintainer may find that a kfunc must be
changed or removed immediately to accommodate some changes in their subsystem,
usually kfuncs will be able to accommodate a longer and more measured
deprecation process. For example, if a new kfunc comes along which provides
superior functionality to an existing kfunc, the existing kfunc may be
deprecated for some period of time to allow users to migrate their BPF programs
to use the new one. Or, if a kfunc has no known users, a decision may be made
to remove the kfunc (without providing an alternative API) after some
deprecation period so as to provide users with a window to notify the kfunc
maintainer if it turns out that the kfunc is actually being used.
It's expected that the common case will be that kfuncs will go through a
deprecation period rather than being changed or removed without warning. As
described in :ref:`KF_deprecated_flag`, the kfunc framework provides the
KF_DEPRECATED flag to kfunc developers to signal to users that a kfunc has been
deprecated. Once a kfunc has been marked with KF_DEPRECATED, the following
procedure is followed for removal:
1. Any relevant information for deprecated kfuncs is documented in the kfunc's
kernel docs. This documentation will typically include the kfunc's expected
remaining lifespan, a recommendation for new functionality that can replace
the usage of the deprecated function (or an explanation as to why no such
replacement exists), etc.
2. The deprecated kfunc is kept in the kernel for some period of time after it
was first marked as deprecated. This time period will be chosen on a
case-by-case basis, and will typically depend on how widespread the use of
the kfunc is, how long it has been in the kernel, and how hard it is to move
to alternatives. This deprecation time period is "best effort", and as
described :ref:`above<BPF_kfunc_lifecycle_expectations>`, circumstances may
sometimes dictate that the kfunc be removed before the full intended
deprecation period has elapsed.
3. After the deprecation period the kfunc will be removed. At this point, BPF
programs calling the kfunc will be rejected by the verifier.
4. Core kfuncs
==============
The BPF subsystem provides a number of "core" kfuncs that are potentially
applicable to a wide variety of different possible use cases and programs.
Those kfuncs are documented here.
3.1 struct task_struct * kfuncs
4.1 struct task_struct * kfuncs
-------------------------------
There are a number of kfuncs that allow ``struct task_struct *`` objects to be
@ -306,7 +502,7 @@ Here is an example of it being used:
return 0;
}
3.2 struct cgroup * kfuncs
4.2 struct cgroup * kfuncs
--------------------------
``struct cgroup *`` objects also have acquire and release functions:
@ -420,3 +616,10 @@ the verifier. bpf_cgroup_ancestor() can be used as follows:
bpf_cgroup_release(parent);
return 0;
}
4.3 struct cpumask * kfuncs
---------------------------
BPF provides a set of kfuncs that can be used to query, allocate, mutate, and
destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label`
for more details.

View File

@ -83,8 +83,8 @@ This prevents from accidentally exporting a symbol, that is not supposed
to be a part of ABI what, in turn, improves both libbpf developer- and
user-experiences.
ABI versionning
---------------
ABI versioning
--------------
To make future ABI extensions possible libbpf ABI is versioned.
Versioning is implemented by ``libbpf.map`` version script that is
@ -148,7 +148,7 @@ API documentation convention
The libbpf API is documented via comments above definitions in
header files. These comments can be rendered by doxygen and sphinx
for well organized html output. This section describes the
convention in which these comments should be formated.
convention in which these comments should be formatted.
Here is an example from btf.h:

View File

@ -0,0 +1,498 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright Red Hat
==============================================
BPF_MAP_TYPE_SOCKMAP and BPF_MAP_TYPE_SOCKHASH
==============================================
.. note::
- ``BPF_MAP_TYPE_SOCKMAP`` was introduced in kernel version 4.14
- ``BPF_MAP_TYPE_SOCKHASH`` was introduced in kernel version 4.18
``BPF_MAP_TYPE_SOCKMAP`` and ``BPF_MAP_TYPE_SOCKHASH`` maps can be used to
redirect skbs between sockets or to apply policy at the socket level based on
the result of a BPF (verdict) program with the help of the BPF helpers
``bpf_sk_redirect_map()``, ``bpf_sk_redirect_hash()``,
``bpf_msg_redirect_map()`` and ``bpf_msg_redirect_hash()``.
``BPF_MAP_TYPE_SOCKMAP`` is backed by an array that uses an integer key as the
index to look up a reference to a ``struct sock``. The map values are socket
descriptors. Similarly, ``BPF_MAP_TYPE_SOCKHASH`` is a hash backed BPF map that
holds references to sockets via their socket descriptors.
.. note::
The value type is either __u32 or __u64; the latter (__u64) is to support
returning socket cookies to userspace. Returning the ``struct sock *`` that
the map holds to user-space is neither safe nor useful.
These maps may have BPF programs attached to them, specifically a parser program
and a verdict program. The parser program determines how much data has been
parsed and therefore how much data needs to be queued to come to a verdict. The
verdict program is essentially the redirect program and can return a verdict
of ``__SK_DROP``, ``__SK_PASS``, or ``__SK_REDIRECT``.
When a socket is inserted into one of these maps, its socket callbacks are
replaced and a ``struct sk_psock`` is attached to it. Additionally, this
``sk_psock`` inherits the programs that are attached to the map.
A sock object may be in multiple maps, but can only inherit a single
parse or verdict program. If adding a sock object to a map would result
in having multiple parser programs the update will return an EBUSY error.
The supported programs to attach to these maps are:
.. code-block:: c
struct sk_psock_progs {
struct bpf_prog *msg_parser;
struct bpf_prog *stream_parser;
struct bpf_prog *stream_verdict;
struct bpf_prog *skb_verdict;
};
.. note::
Users are not allowed to attach ``stream_verdict`` and ``skb_verdict``
programs to the same map.
The attach types for the map programs are:
- ``msg_parser`` program - ``BPF_SK_MSG_VERDICT``.
- ``stream_parser`` program - ``BPF_SK_SKB_STREAM_PARSER``.
- ``stream_verdict`` program - ``BPF_SK_SKB_STREAM_VERDICT``.
- ``skb_verdict`` program - ``BPF_SK_SKB_VERDICT``.
There are additional helpers available to use with the parser and verdict
programs: ``bpf_msg_apply_bytes()`` and ``bpf_msg_cork_bytes()``. With
``bpf_msg_apply_bytes()`` BPF programs can tell the infrastructure how many
bytes the given verdict should apply to. The helper ``bpf_msg_cork_bytes()``
handles a different case where a BPF program cannot reach a verdict on a msg
until it receives more bytes AND the program doesn't want to forward the packet
until it is known to be good.
Finally, the helpers ``bpf_msg_pull_data()`` and ``bpf_msg_push_data()`` are
available to ``BPF_PROG_TYPE_SK_MSG`` BPF programs to pull in data and set the
start and end pointers to given values or to add metadata to the ``struct
sk_msg_buff *msg``.
All these helpers will be described in more detail below.
Usage
=====
Kernel BPF
----------
bpf_msg_redirect_map()
^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags)
This helper is used in programs implementing policies at the socket level. If
the message ``msg`` is allowed to pass (i.e., if the verdict BPF program
returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type
``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces
can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used
to select the ingress path otherwise the egress path is selected. This is the
only flag supported for now.
Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
bpf_sk_redirect_map()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_sk_redirect_map(struct sk_buff *skb, struct bpf_map *map, u32 key u64 flags)
Redirect the packet to the socket referenced by ``map`` (of type
``BPF_MAP_TYPE_SOCKMAP``) at index ``key``. Both ingress and egress interfaces
can be used for redirection. The ``BPF_F_INGRESS`` value in ``flags`` is used
to select the ingress path otherwise the egress path is selected. This is the
only flag supported for now.
Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
socket entries of type ``struct sock *`` can be retrieved using the
``bpf_map_lookup_elem()`` helper.
bpf_sock_map_update()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
Add an entry to, or update a ``map`` referencing sockets. The ``skops`` is used
as a new value for the entry associated to ``key``. The ``flags`` argument can
be one of the following:
- ``BPF_ANY``: Create a new element or update an existing element.
- ``BPF_NOEXIST``: Create a new element only if it did not exist.
- ``BPF_EXIST``: Update an existing element.
If the ``map`` has BPF programs (parser and verdict), those will be inherited
by the socket being added. If the socket is already attached to BPF programs,
this results in an error.
Returns 0 on success, or a negative error in case of failure.
bpf_sock_hash_update()
^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_sock_hash_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags)
Add an entry to, or update a sockhash ``map`` referencing sockets. The ``skops``
is used as a new value for the entry associated to ``key``.
The ``flags`` argument can be one of the following:
- ``BPF_ANY``: Create a new element or update an existing element.
- ``BPF_NOEXIST``: Create a new element only if it did not exist.
- ``BPF_EXIST``: Update an existing element.
If the ``map`` has BPF programs (parser and verdict), those will be inherited
by the socket being added. If the socket is already attached to BPF programs,
this results in an error.
Returns 0 on success, or a negative error in case of failure.
bpf_msg_redirect_hash()
^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
This helper is used in programs implementing policies at the socket level. If
the message ``msg`` is allowed to pass (i.e., if the verdict BPF program returns
``SK_PASS``), redirect it to the socket referenced by ``map`` (of type
``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress
interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in
``flags`` is used to select the ingress path otherwise the egress path is
selected. This is the only flag supported for now.
Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
bpf_sk_redirect_hash()
^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
This helper is used in programs implementing policies at the skb socket level.
If the sk_buff ``skb`` is allowed to pass (i.e., if the verdict BPF program
returns ``SK_PASS``), redirect it to the socket referenced by ``map`` (of type
``BPF_MAP_TYPE_SOCKHASH``) using hash ``key``. Both ingress and egress
interfaces can be used for redirection. The ``BPF_F_INGRESS`` value in
``flags`` is used to select the ingress path otherwise the egress path is
selected. This is the only flag supported for now.
Returns ``SK_PASS`` on success, or ``SK_DROP`` on error.
bpf_msg_apply_bytes()
^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes)
For socket policies, apply the verdict of the BPF program to the next (number
of ``bytes``) of message ``msg``. For example, this helper can be used in the
following cases:
- A single ``sendmsg()`` or ``sendfile()`` system call contains multiple
logical messages that the BPF program is supposed to read and for which it
should apply a verdict.
- A BPF program only cares to read the first ``bytes`` of a ``msg``. If the
message has a large payload, then setting up and calling the BPF program
repeatedly for all bytes, even though the verdict is already known, would
create unnecessary overhead.
Returns 0
bpf_msg_cork_bytes()
^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes)
For socket policies, prevent the execution of the verdict BPF program for
message ``msg`` until the number of ``bytes`` have been accumulated.
This can be used when one needs a specific number of bytes before a verdict can
be assigned, even if the data spans multiple ``sendmsg()`` or ``sendfile()``
calls.
Returns 0
bpf_msg_pull_data()
^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags)
For socket policies, pull in non-linear data from user space for ``msg`` and set
pointers ``msg->data`` and ``msg->data_end`` to ``start`` and ``end`` bytes
offsets into ``msg``, respectively.
If a program of type ``BPF_PROG_TYPE_SK_MSG`` is run on a ``msg`` it can only
parse data that the (``data``, ``data_end``) pointers have already consumed.
For ``sendmsg()`` hooks this is likely the first scatterlist element. But for
calls relying on the ``sendpage`` handler (e.g., ``sendfile()``) this will be
the range (**0**, **0**) because the data is shared with user space and by
default the objective is to avoid allowing user space to modify data while (or
after) BPF verdict is being decided. This helper can be used to pull in data
and to set the start and end pointers to given values. Data will be copied if
necessary (i.e., if data was not linear and if start and end pointers do not
point to the same chunk).
A call to this helper is susceptible to change the underlying packet buffer.
Therefore, at load time, all checks on pointers previously done by the verifier
are invalidated and must be performed again, if the helper is used in
combination with direct packet access.
All values for ``flags`` are reserved for future usage, and must be left at
zero.
Returns 0 on success, or a negative error in case of failure.
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Look up a socket entry in the sockmap or sockhash map.
Returns the socket entry associated to ``key``, or NULL if no entry was found.
bpf_map_update_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
Add or update a socket entry in a sockmap or sockhash.
The flags argument can be one of the following:
- BPF_ANY: Create a new element or update an existing element.
- BPF_NOEXIST: Create a new element only if it did not exist.
- BPF_EXIST: Update an existing element.
Returns 0 on success, or a negative error in case of failure.
bpf_map_delete_elem()
^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
long bpf_map_delete_elem(struct bpf_map *map, const void *key)
Delete a socket entry from a sockmap or a sockhash.
Returns 0 on success, or a negative error in case of failure.
User space
----------
bpf_map_update_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags)
Sockmap entries can be added or updated using the ``bpf_map_update_elem()``
function. The ``key`` parameter is the index value of the sockmap array. And the
``value`` parameter is the FD value of that socket.
Under the hood, the sockmap update function uses the socket FD value to
retrieve the associated socket and its attached psock.
The flags argument can be one of the following:
- BPF_ANY: Create a new element or update an existing element.
- BPF_NOEXIST: Create a new element only if it did not exist.
- BPF_EXIST: Update an existing element.
bpf_map_lookup_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_lookup_elem(int fd, const void *key, void *value)
Sockmap entries can be retrieved using the ``bpf_map_lookup_elem()`` function.
.. note::
The entry returned is a socket cookie rather than a socket itself.
bpf_map_delete_elem()
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: c
int bpf_map_delete_elem(int fd, const void *key)
Sockmap entries can be deleted using the ``bpf_map_delete_elem()``
function.
Returns 0 on success, or negative error in case of failure.
Examples
========
Kernel BPF
----------
Several examples of the use of sockmap APIs can be found in:
- `tools/testing/selftests/bpf/progs/test_sockmap_kern.h`_
- `tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`_
- `tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`_
- `tools/testing/selftests/bpf/progs/test_sockmap_listen.c`_
- `tools/testing/selftests/bpf/progs/test_sockmap_update.c`_
The following code snippet shows how to declare a sockmap.
.. code-block:: c
struct {
__uint(type, BPF_MAP_TYPE_SOCKMAP);
__uint(max_entries, 1);
__type(key, __u32);
__type(value, __u64);
} sock_map_rx SEC(".maps");
The following code snippet shows a sample parser program.
.. code-block:: c
SEC("sk_skb/stream_parser")
int bpf_prog_parser(struct __sk_buff *skb)
{
return skb->len;
}
The following code snippet shows a simple verdict program that interacts with a
sockmap to redirect traffic to another socket based on the local port.
.. code-block:: c
SEC("sk_skb/stream_verdict")
int bpf_prog_verdict(struct __sk_buff *skb)
{
__u32 lport = skb->local_port;
__u32 idx = 0;
if (lport == 10000)
return bpf_sk_redirect_map(skb, &sock_map_rx, idx, 0);
return SK_PASS;
}
The following code snippet shows how to declare a sockhash map.
.. code-block:: c
struct socket_key {
__u32 src_ip;
__u32 dst_ip;
__u32 src_port;
__u32 dst_port;
};
struct {
__uint(type, BPF_MAP_TYPE_SOCKHASH);
__uint(max_entries, 1);
__type(key, struct socket_key);
__type(value, __u64);
} sock_hash_rx SEC(".maps");
The following code snippet shows a simple verdict program that interacts with a
sockhash to redirect traffic to another socket based on a hash of some of the
skb parameters.
.. code-block:: c
static inline
void extract_socket_key(struct __sk_buff *skb, struct socket_key *key)
{
key->src_ip = skb->remote_ip4;
key->dst_ip = skb->local_ip4;
key->src_port = skb->remote_port >> 16;
key->dst_port = (bpf_htonl(skb->local_port)) >> 16;
}
SEC("sk_skb/stream_verdict")
int bpf_prog_verdict(struct __sk_buff *skb)
{
struct socket_key key;
extract_socket_key(skb, &key);
return bpf_sk_redirect_hash(skb, &sock_hash_rx, &key, 0);
}
User space
----------
Several examples of the use of sockmap APIs can be found in:
- `tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`_
- `tools/testing/selftests/bpf/test_sockmap.c`_
- `tools/testing/selftests/bpf/test_maps.c`_
The following code sample shows how to create a sockmap, attach a parser and
verdict program, as well as add a socket entry.
.. code-block:: c
int create_sample_sockmap(int sock, int parse_prog_fd, int verdict_prog_fd)
{
int index = 0;
int map, err;
map = bpf_map_create(BPF_MAP_TYPE_SOCKMAP, NULL, sizeof(int), sizeof(int), 1, NULL);
if (map < 0) {
fprintf(stderr, "Failed to create sockmap: %s\n", strerror(errno));
return -1;
}
err = bpf_prog_attach(parse_prog_fd, map, BPF_SK_SKB_STREAM_PARSER, 0);
if (err){
fprintf(stderr, "Failed to attach_parser_prog_to_map: %s\n", strerror(errno));
goto out;
}
err = bpf_prog_attach(verdict_prog_fd, map, BPF_SK_SKB_STREAM_VERDICT, 0);
if (err){
fprintf(stderr, "Failed to attach_verdict_prog_to_map: %s\n", strerror(errno));
goto out;
}
err = bpf_map_update_elem(map, &index, &sock, BPF_NOEXIST);
if (err) {
fprintf(stderr, "Failed to update sockmap: %s\n", strerror(errno));
goto out;
}
out:
close(map);
return err;
}
References
===========
- https://github.com/jrfastab/linux-kernel-xdp/commit/c89fd73cb9d2d7f3c716c3e00836f07b1aeb261f
- https://lwn.net/Articles/731133/
- http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf
- https://lwn.net/Articles/748628/
- https://lore.kernel.org/bpf/20200218171023.844439-7-jakub@cloudflare.com/
.. _`tools/testing/selftests/bpf/progs/test_sockmap_kern.h`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_kern.h
.. _`tools/testing/selftests/bpf/progs/sockmap_parse_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
.. _`tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/sockmap_verdict_prog.c
.. _`tools/testing/selftests/bpf/prog_tests/sockmap_basic.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
.. _`tools/testing/selftests/bpf/test_sockmap.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_sockmap.c
.. _`tools/testing/selftests/bpf/test_maps.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/test_maps.c
.. _`tools/testing/selftests/bpf/progs/test_sockmap_listen.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
.. _`tools/testing/selftests/bpf/progs/test_sockmap_update.c`: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/progs/test_sockmap_update.c

View File

@ -178,7 +178,7 @@ The following code snippet shows how to update an XSKMAP with an XSK entry.
For an example on how create AF_XDP sockets, please see the AF_XDP-example and
AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository.
For a detailed explaination of the AF_XDP interface please see:
For a detailed explanation of the AF_XDP interface please see:
- `libxdp-readme`_.
- `AF_XDP`_ kernel documentation.

View File

@ -7,3 +7,4 @@ Other
ringbuf
llvm_reloc
graph_ds_impl

View File

@ -124,7 +124,7 @@ buffer. Currently 4 are supported:
- ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
- ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition
- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical position
of consumer/producer, respectively.
Returned values are momentarily snapshots of ring buffer state and could be
@ -146,7 +146,7 @@ Design and Implementation
This reserve/commit schema allows a natural way for multiple producers, either
on different CPUs or even on the same CPU/in the same BPF program, to reserve
independent records and work with them without blocking other producers. This
means that if BPF program was interruped by another BPF program sharing the
means that if BPF program was interrupted by another BPF program sharing the
same ring buffer, they will both get a record reserved (provided there is
enough space left) and can work with it and submit it independently. This
applies to NMI context as well, except that due to using a spinlock during

View File

@ -192,7 +192,7 @@ checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
As well as range-checking, the tracked information is also used for enforcing
alignment of pointer accesses. For instance, on most systems the packet pointer
is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
over the Ethernet header, then reads IHL and adds (IHL * 4), the resulting
pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
that pointer are safe.
@ -316,6 +316,301 @@ Pruning considers not only the registers but also the stack (and any spilled
registers it may hold). They must all be safe for the branch to be pruned.
This is implemented in states_equal().
Some technical details about state pruning implementation could be found below.
Register liveness tracking
--------------------------
In order to make state pruning effective, liveness state is tracked for each
register and stack slot. The basic idea is to track which registers and stack
slots are actually used during subseqeuent execution of the program, until
program exit is reached. Registers and stack slots that were never used could be
removed from the cached state thus making more states equivalent to a cached
state. This could be illustrated by the following program::
0: call bpf_get_prandom_u32()
1: r1 = 0
2: if r0 == 0 goto +1
3: r0 = 1
--- checkpoint ---
4: r0 = r1
5: exit
Suppose that a state cache entry is created at instruction #4 (such entries are
also called "checkpoints" in the text below). The verifier could reach the
instruction with one of two possible register states:
* r0 = 1, r1 = 0
* r0 = 0, r1 = 0
However, only the value of register ``r1`` is important to successfully finish
verification. The goal of the liveness tracking algorithm is to spot this fact
and figure out that both states are actually equivalent.
Data structures
~~~~~~~~~~~~~~~
Liveness is tracked using the following data structures::
enum bpf_reg_liveness {
REG_LIVE_NONE = 0,
REG_LIVE_READ32 = 0x1,
REG_LIVE_READ64 = 0x2,
REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64,
REG_LIVE_WRITTEN = 0x4,
REG_LIVE_DONE = 0x8,
};
struct bpf_reg_state {
...
struct bpf_reg_state *parent;
...
enum bpf_reg_liveness live;
...
};
struct bpf_stack_state {
struct bpf_reg_state spilled_ptr;
...
};
struct bpf_func_state {
struct bpf_reg_state regs[MAX_BPF_REG];
...
struct bpf_stack_state *stack;
}
struct bpf_verifier_state {
struct bpf_func_state *frame[MAX_CALL_FRAMES];
struct bpf_verifier_state *parent;
...
}
* ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new
verifier state creation;
* ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is
defined by some instruction verified between this verifier state's parent and
verifier state itself;
* ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot)
is read by a some child state of this verifier state;
* ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid
processing same verifier state multiple times and for some sanity checks;
* ``->live`` field values are formed by combining ``enum bpf_reg_liveness``
values using bitwise or.
Register parentage chains
~~~~~~~~~~~~~~~~~~~~~~~~~
In order to propagate information between parent and child states, a *register
parentage chain* is established. Each register or stack slot is linked to a
corresponding register or stack slot in its parent state via a ``->parent``
pointer. This link is established upon state creation in ``is_state_visited()``
and might be modified by ``set_callee_state()`` called from
``__check_func_call()``.
The rules for correspondence between registers / stack slots are as follows:
* For the current stack frame, registers and stack slots of the new state are
linked to the registers and stack slots of the parent state with the same
indices.
* For the outer stack frames, only caller saved registers (r6-r9) and stack
slots are linked to the registers and stack slots of the parent state with the
same indices.
* When function call is processed a new ``struct bpf_func_state`` instance is
allocated, it encapsulates a new set of registers and stack slots. For this
new frame, parent links for r6-r9 and stack slots are set to nil, parent links
for r1-r5 are set to match caller r1-r5 parent links.
This could be illustrated by the following diagram (arrows stand for
``->parent`` pointers)::
... ; Frame #0, some instructions
--- checkpoint #0 ---
1 : r6 = 42 ; Frame #0
--- checkpoint #1 ---
2 : call foo() ; Frame #0
... ; Frame #1, instructions from foo()
--- checkpoint #2 ---
... ; Frame #1, instructions from foo()
--- checkpoint #3 ---
exit ; Frame #1, return from foo()
3 : r1 = r6 ; Frame #0 <- current state
+-------------------------------+-------------------------------+
| Frame #0 | Frame #1 |
Checkpoint +-------------------------------+-------------------------------+
#0 | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+
^ ^ ^ ^
| | | |
Checkpoint +-------------------------------+
#1 | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+
^ ^ ^
|_______|_______|_______________
| | |
nil nil | | | nil nil
| | | | | | |
Checkpoint +-------------------------------+-------------------------------+
#2 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+-------------------------------+
^ ^ ^ ^ ^
nil nil | | | | |
| | | | | | |
Checkpoint +-------------------------------+-------------------------------+
#3 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+-------------------------------+
^ ^
nil nil | |
| | | |
Current +-------------------------------+
state | r0 | r1-r5 | r6-r9 | fp-8 ... |
+-------------------------------+
\
r6 read mark is propagated via these links
all the way up to checkpoint #1.
The checkpoint #1 contains a write mark for r6
because of instruction (1), thus read propagation
does not reach checkpoint #0 (see section below).
Liveness marks tracking
~~~~~~~~~~~~~~~~~~~~~~~
For each processed instruction, the verifier tracks read and written registers
and stack slots. The main idea of the algorithm is that read marks propagate
back along the state parentage chain until they hit a write mark, which 'screens
off' earlier states from the read. The information about reads is propagated by
function ``mark_reg_read()`` which could be summarized as follows::
mark_reg_read(struct bpf_reg_state *state, ...):
parent = state->parent
while parent:
if state->live & REG_LIVE_WRITTEN:
break
if parent->live & REG_LIVE_READ64:
break
parent->live |= REG_LIVE_READ64
state = parent
parent = state->parent
Notes:
* The read marks are applied to the **parent** state while write marks are
applied to the **current** state. The write mark on a register or stack slot
means that it is updated by some instruction in the straight-line code leading
from the parent state to the current state.
* Details about REG_LIVE_READ32 are omitted.
* Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`)
might override the first parent link. Please refer to the comments in the
``propagate_liveness()`` and ``mark_reg_read()`` source code for further
details.
Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are
applied conservatively: stack slots are marked as written only if write size
corresponds to the size of the register, e.g. see function ``save_register_state()``.
Consider the following example::
0: (*u64)(r10 - 8) = 0 ; define 8 bytes of fp-8
--- checkpoint #0 ---
1: (*u32)(r10 - 8) = 1 ; redefine lower 4 bytes
2: r1 = (*u32)(r10 - 8) ; read lower 4 bytes defined at (1)
3: r2 = (*u32)(r10 - 4) ; read upper 4 bytes defined at (0)
As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should
it be otherwise, the algorithm above wouldn't be able to propagate the read mark
from (3) to checkpoint #0.
Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is
called to update the ``->branches`` counter for each verifier state in a chain
of parent verifier states. When the ``->branches`` counter reaches zero the
verifier state becomes a valid entry in a set of cached verifier states.
Each entry of the verifier states cache is post-processed by a function
``clean_live_states()``. This function marks all registers and stack slots
without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``.
Registers/stack slots marked in this way are ignored in function ``stacksafe()``
called from ``states_equal()`` when a state cache entry is considered for
equivalence with a current state.
Now it is possible to explain how the example from the beginning of the section
works::
0: call bpf_get_prandom_u32()
1: r1 = 0
2: if r0 == 0 goto +1
3: r0 = 1
--- checkpoint[0] ---
4: r0 = r1
5: exit
* At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }``
is pushed to states processing queue (pc stands for program counter).
* At instruction #4:
* ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``;
* ``checkpoint[0].r0`` is marked as written;
* ``checkpoint[0].r1`` is marked as read;
* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a
read mark and all other registers and stack slots are marked as ``NOT_INIT``
or ``STACK_INVALID``
* The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue
and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states
are considered equivalent.
.. _read_marks_for_cache_hits:
Read marks propagation for cache hits
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Another point is the handling of read marks when a previously verified state is
found in the states cache. Upon cache hit verifier must behave in the same way
as if the current state was verified to the program exit. This means that all
read marks, present on registers and stack slots of the cached state, must be
propagated over the parentage chain of the current state. Example below shows
why this is important. Function ``propagate_liveness()`` handles this case.
Consider the following state parentage chain (S is a starting state, A-E are
derived states, -> arrows show which state is derived from which)::
r1 read
<------------- A[r1] == 0
C[r1] == 0
S ---> A ---> B ---> exit E[r1] == 1
|
` ---> C ---> D
|
` ---> E ^
|___ suppose all these
^ states are at insn #Y
|
suppose all these
states are at insn #X
* Chain of states ``S -> A -> B -> exit`` is verified first.
* While ``B -> exit`` is verified, register ``r1`` is read and this read mark is
propagated up to state ``A``.
* When chain of states ``C -> D`` is verified the state ``D`` turns out to be
equivalent to state ``B``.
* The read mark for ``r1`` has to be propagated to state ``C``, otherwise state
``C`` might get mistakenly marked as equivalent to state ``E`` even though
values for register ``r1`` differ between ``C`` and ``E``.
Understanding eBPF verifier messages
====================================

View File

@ -116,6 +116,9 @@ if major >= 3:
# include/linux/linkage.h:
"asmlinkage",
# include/linux/btf.h
"__bpf_kfunc",
]
else:

View File

@ -127,6 +127,7 @@ Documents that don't fit elsewhere or which have yet to be categorized.
:maxdepth: 1
librs
netlink
.. only:: subproject and html

View File

@ -0,0 +1,101 @@
.. SPDX-License-Identifier: BSD-3-Clause
.. _kernel_netlink:
===================================
Netlink notes for kernel developers
===================================
General guidance
================
Attribute enums
---------------
Older families often define "null" attributes and commands with value
of ``0`` and named ``unspec``. This is supported (``type: unused``)
but should be avoided in new families. The ``unspec`` enum values are
not used in practice, so just set the value of the first attribute to ``1``.
Message enums
-------------
Use the same command IDs for requests and replies. This makes it easier
to match them up, and we have plenty of ID space.
Use separate command IDs for notifications. This makes it easier to
sort the notifications from replies (and present them to the user
application via a different API than replies).
Answer requests
---------------
Older families do not reply to all of the commands, especially NEW / ADD
commands. User only gets information whether the operation succeeded or
not via the ACK. Try to find useful data to return. Once the command is
added whether it replies with a full message or only an ACK is uAPI and
cannot be changed. It's better to err on the side of replying.
Specifically NEW and ADD commands should reply with information identifying
the created object such as the allocated object's ID (without having to
resort to using ``NLM_F_ECHO``).
NLM_F_ECHO
----------
Make sure to pass the request info to genl_notify() to allow ``NLM_F_ECHO``
to take effect. This is useful for programs that need precise feedback
from the kernel (for example for logging purposes).
Support dump consistency
------------------------
If iterating over objects during dump may skip over objects or repeat
them - make sure to report dump inconsistency with ``NLM_F_DUMP_INTR``.
This is usually implemented by maintaining a generation id for the
structure and recording it in the ``seq`` member of struct netlink_callback.
Netlink specification
=====================
Documentation of the Netlink specification parts which are only relevant
to the kernel space.
Globals
-------
kernel-policy
~~~~~~~~~~~~~
Defines if the kernel validation policy is per operation (``per-op``)
or for the entire family (``global``). New families should use ``per-op``
(default) to be able to narrow down the attributes accepted by a specific
command.
checks
------
Documentation for the ``checks`` sub-sections of attribute specs.
unterminated-ok
~~~~~~~~~~~~~~~
Accept strings without the null-termination (for legacy families only).
Switches from the ``NLA_NUL_STRING`` to ``NLA_STRING`` policy type.
max-len
~~~~~~~
Defines max length for a binary or string attribute (corresponding
to the ``len`` member of struct nla_policy). For string attributes terminating
null character is not counted towards ``max-len``.
The field may either be a literal integer value or a name of a defined
constant. String types may reduce the constant by one
(i.e. specify ``max-len: CONST - 1``) to reserve space for the terminating
character so implementations should recognize such pattern.
min-len
~~~~~~~
Similar to ``max-len`` but defines minimum length.

View File

@ -161,6 +161,6 @@ xxx_packing() that calls it using the proper QUIRK_* one-hot bits set.
The packing() function returns an int-encoded error code, which protects the
programmer against incorrect API use. The errors are not expected to occur
durring runtime, therefore it is reasonable for xxx_packing() to return void
during runtime, therefore it is reasonable for xxx_packing() to return void
and simply swallow those errors. Optionally it can dump stack or print the
error description.

View File

@ -57,6 +57,15 @@ patternProperties:
enum:
- mscc,ocelot-miim
"^ethernet-switch@[0-9a-f]+$":
type: object
$ref: /schemas/net/mscc,vsc7514-switch.yaml
unevaluatedProperties: false
properties:
compatible:
enum:
- mscc,vsc7512-switch
required:
- compatible
- reg

View File

@ -0,0 +1,80 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/amlogic,g12a-mdio-mux.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: MDIO bus multiplexer/glue of Amlogic G12a SoC family
description:
This is a special case of a MDIO bus multiplexer. It allows to choose between
the internal mdio bus leading to the embedded 10/100 PHY or the external
MDIO bus.
maintainers:
- Neil Armstrong <neil.armstrong@linaro.org>
allOf:
- $ref: mdio-mux.yaml#
properties:
compatible:
const: amlogic,g12a-mdio-mux
reg:
maxItems: 1
clocks:
items:
- description: peripheral clock
- description: platform crytal
- description: SoC 50MHz MPLL
clock-names:
items:
- const: pclk
- const: clkin0
- const: clkin1
required:
- compatible
- reg
- clocks
- clock-names
unevaluatedProperties: false
examples:
- |
#include <dt-bindings/interrupt-controller/irq.h>
#include <dt-bindings/interrupt-controller/arm-gic.h>
mdio-multiplexer@4c000 {
compatible = "amlogic,g12a-mdio-mux";
reg = <0x4c000 0xa4>;
clocks = <&clkc_eth_phy>, <&xtal>, <&clkc_mpll>;
clock-names = "pclk", "clkin0", "clkin1";
mdio-parent-bus = <&mdio0>;
#address-cells = <1>;
#size-cells = <0>;
mdio@0 {
reg = <0>;
#address-cells = <1>;
#size-cells = <0>;
};
mdio@1 {
reg = <1>;
#address-cells = <1>;
#size-cells = <0>;
ethernet-phy@8 {
compatible = "ethernet-phy-id0180.3301",
"ethernet-phy-ieee802.3-c22";
interrupts = <GIC_SPI 9 IRQ_TYPE_LEVEL_HIGH>;
reg = <8>;
max-speed = <100>;
};
};
};
...

View File

@ -0,0 +1,64 @@
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/amlogic,gxl-mdio-mux.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Amlogic GXL MDIO bus multiplexer
maintainers:
- Jerome Brunet <jbrunet@baylibre.com>
description:
This is a special case of a MDIO bus multiplexer. It allows to choose between
the internal mdio bus leading to the embedded 10/100 PHY or the external
MDIO bus on the Amlogic GXL SoC family.
allOf:
- $ref: mdio-mux.yaml#
properties:
compatible:
const: amlogic,gxl-mdio-mux
reg:
maxItems: 1
clocks:
maxItems: 1
clock-names:
items:
- const: ref
required:
- compatible
- reg
- clocks
- clock-names
unevaluatedProperties: false
examples:
- |
eth_phy_mux: mdio@558 {
compatible = "amlogic,gxl-mdio-mux";
reg = <0x558 0xc>;
#address-cells = <1>;
#size-cells = <0>;
clocks = <&refclk>;
clock-names = "ref";
mdio-parent-bus = <&mdio0>;
external_mdio: mdio@0 {
reg = <0x0>;
#address-cells = <1>;
#size-cells = <0>;
};
internal_mdio: mdio@1 {
reg = <0x1>;
#address-cells = <1>;
#size-cells = <0>;
};
};

View File

@ -19,6 +19,7 @@ description: |
allOf:
- $ref: ethernet-controller.yaml#
- $ref: /schemas/spi/spi-peripheral-props.yaml
properties:
compatible:
@ -39,8 +40,8 @@ properties:
it should be marked GPIO_ACTIVE_LOW.
maxItems: 1
controller-data: true
local-mac-address: true
mac-address: true
required:

View File

@ -28,6 +28,12 @@ properties:
- renesas,r8a77995-canfd # R-Car D3
- const: renesas,rcar-gen3-canfd # R-Car Gen3 and RZ/G2
- items:
- enum:
- renesas,r8a779a0-canfd # R-Car V3U
- renesas,r8a779g0-canfd # R-Car V4H
- const: renesas,rcar-gen4-canfd # R-Car Gen4
- items:
- enum:
- renesas,r9a07g043-canfd # RZ/G2UL and RZ/Five
@ -35,8 +41,6 @@ properties:
- renesas,r9a07g054-canfd # RZ/V2L
- const: renesas,rzg2l-canfd # RZ/G2L family
- const: renesas,r8a779a0-canfd # R-Car V3U
reg:
maxItems: 1
@ -60,7 +64,7 @@ properties:
$ref: /schemas/types.yaml#/definitions/flag
description:
The controller can operate in either CAN FD only mode (default) or
Classical CAN only mode. The mode is global to both the channels.
Classical CAN only mode. The mode is global to all channels.
Specify this property to put the controller in Classical CAN only mode.
assigned-clocks:
@ -80,6 +84,10 @@ patternProperties:
The controller supports multiple channels and each is represented as a
child node. Each channel can be enabled/disabled individually.
properties:
phys:
maxItems: 1
additionalProperties: false
required:
@ -159,7 +167,7 @@ allOf:
properties:
compatible:
contains:
const: renesas,r8a779a0-canfd
const: renesas,rcar-gen4-canfd
then:
patternProperties:
"^channel[2-7]$": false

View File

@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Arrow SpeedChips XRS7000 Series Switch
allOf:
- $ref: dsa.yaml#
- $ref: dsa.yaml#/$defs/ethernet-ports
maintainers:
- George McCollister <george.mccollister@gmail.com>

View File

@ -66,7 +66,7 @@ required:
- reg
allOf:
- $ref: dsa.yaml#
- $ref: dsa.yaml#/$defs/ethernet-ports
- if:
properties:
compatible:

View File

@ -85,6 +85,11 @@ properties:
ports:
type: object
patternProperties:
'^port@[0-9a-f]$':
$ref: dsa-port.yaml#
unevaluatedProperties: false
properties:
brcm,use-bcm-hdr:
description: if present, indicates that the switch port has Broadcom

View File

@ -4,18 +4,19 @@
$id: http://devicetree.org/schemas/net/dsa/dsa-port.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Ethernet Switch port
title: Generic DSA Switch Port
maintainers:
- Andrew Lunn <andrew@lunn.ch>
- Florian Fainelli <f.fainelli@gmail.com>
- Vivien Didelot <vivien.didelot@gmail.com>
- Vladimir Oltean <olteanv@gmail.com>
description:
Ethernet switch port Description
A DSA switch port is a component of a switch that manages one MAC, and can
pass Ethernet frames. It can act as a stanadard Ethernet switch port, or have
DSA-specific functionality.
allOf:
- $ref: /schemas/net/ethernet-controller.yaml#
$ref: /schemas/net/ethernet-switch-port.yaml#
properties:
reg:
@ -58,25 +59,6 @@ properties:
- rtl8_4t
- seville
phy-handle: true
phy-mode: true
fixed-link: true
mac-address: true
sfp: true
managed: true
rx-internal-delay-ps: true
tx-internal-delay-ps: true
required:
- reg
# CPU and DSA ports must have phylink-compatible link descriptions
if:
oneOf:

View File

@ -9,7 +9,7 @@ title: Ethernet Switch
maintainers:
- Andrew Lunn <andrew@lunn.ch>
- Florian Fainelli <f.fainelli@gmail.com>
- Vivien Didelot <vivien.didelot@gmail.com>
- Vladimir Oltean <olteanv@gmail.com>
description:
This binding represents Ethernet Switches which have a dedicated CPU
@ -18,10 +18,9 @@ description:
select: false
properties:
$nodename:
pattern: "^(ethernet-)?switch(@.*)?$"
$ref: /schemas/net/ethernet-switch.yaml#
properties:
dsa,member:
minItems: 2
maxItems: 2
@ -32,9 +31,18 @@ properties:
(single device hanging off a CPU port) must not specify this property
$ref: /schemas/types.yaml#/definitions/uint32-array
additionalProperties: true
$defs:
ethernet-ports:
description: A DSA switch without any extra port properties
$ref: '#/'
patternProperties:
"^(ethernet-)?ports$":
type: object
additionalProperties: false
properties:
'#address-cells':
const: 1
@ -43,19 +51,8 @@ patternProperties:
patternProperties:
"^(ethernet-)?port@[0-9]+$":
type: object
description: Ethernet switch ports
$ref: dsa-port.yaml#
unevaluatedProperties: false
oneOf:
- required:
- ports
- required:
- ethernet-ports
additionalProperties: true
...

View File

@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Hirschmann Hellcreek TSN Switch
allOf:
- $ref: dsa.yaml#
- $ref: dsa.yaml#/$defs/ethernet-ports
maintainers:
- Andrew Lunn <andrew@lunn.ch>

View File

@ -24,56 +24,46 @@ description: |
There is only the standalone version of MT7531.
Port 5 on MT7530 has got various ways of configuration.
For standalone MT7530:
Port 5 on MT7530 has got various ways of configuration:
- Port 5 can be used as a CPU port.
- PHY 0 or 4 of the switch can be muxed to connect to the gmac of the SoC
which port 5 is wired to. Usually used for connecting the wan port
directly to the CPU to achieve 2 Gbps routing in total.
- PHY 0 or 4 of the switch can be muxed to gmac5 of the switch. Therefore,
the gmac of the SoC which is wired to port 5 can connect to the PHY.
This is usually used for connecting the wan port directly to the CPU to
achieve 2 Gbps routing in total.
The driver looks up the reg on the ethernet-phy node which the phy-handle
property refers to on the gmac node to mux the specified phy.
The driver looks up the reg on the ethernet-phy node, which the phy-handle
property on the gmac node refers to, to mux the specified phy.
The driver requires the gmac of the SoC to have "mediatek,eth-mac" as the
compatible string and the reg must be 1. So, for now, only gmac1 of an
compatible string and the reg must be 1. So, for now, only gmac1 of a
MediaTek SoC can benefit this. Banana Pi BPI-R2 suits this.
Check out example 5 for a similar configuration.
- Port 5 can be wired to an external phy. Port 5 becomes a DSA slave.
Check out example 7 for a similar configuration.
For multi-chip module MT7530:
- Port 5 can be used as a CPU port.
- PHY 0 or 4 of the switch can be muxed to connect to gmac1 of the SoC.
Usually used for connecting the wan port directly to the CPU to achieve 2
Gbps routing in total.
The driver looks up the reg on the ethernet-phy node which the phy-handle
property refers to on the gmac node to mux the specified phy.
For the MT7621 SoCs, rgmii2 group must be claimed with rgmii2 function.
Check out example 5.
- In case of an external phy wired to gmac1 of the SoC, port 5 must not be
enabled.
- For the multi-chip module MT7530, in case of an external phy wired to
gmac1 of the SoC, port 5 must not be enabled.
In case of muxing PHY 0 or 4, the external phy must not be enabled.
For the MT7621 SoCs, rgmii2 group must be claimed with rgmii2 function.
Check out example 6.
- Port 5 can be muxed to an external phy. Port 5 becomes a DSA slave.
The external phy must be wired TX to TX to gmac1 of the SoC for this to
work. Ubiquiti EdgeRouter X SFP is wired this way.
- Port 5 can be wired to an external phy. Port 5 becomes a DSA slave.
Muxing PHY 0 or 4 won't work when the external phy is connected TX to TX.
For the multi-chip module MT7530, the external phy must be wired TX to TX
to gmac1 of the SoC for this to work. Ubiquiti EdgeRouter X SFP is wired
this way.
For the multi-chip module MT7530, muxing PHY 0 or 4 won't work when the
external phy is connected TX to TX.
For the MT7621 SoCs, rgmii2 group must be claimed with gpio function.
Check out example 7.
properties:
@ -157,9 +147,6 @@ patternProperties:
patternProperties:
"^(ethernet-)?port@[0-9]+$":
type: object
description: Ethernet switch ports
unevaluatedProperties: false
properties:
reg:
@ -168,7 +155,6 @@ patternProperties:
for user ports.
allOf:
- $ref: dsa-port.yaml#
- if:
required: [ ethernet ]
then:
@ -238,7 +224,7 @@ $defs:
- sgmii
allOf:
- $ref: dsa.yaml#
- $ref: dsa.yaml#/$defs/ethernet-ports
- if:
required:
- mediatek,mcm
@ -605,7 +591,7 @@ examples:
label = "lan4";
};
/* Commented out, phy4 is muxed to gmac1.
/* Commented out, phy4 is connected to gmac1.
port@4 {
reg = <4>;
label = "wan";

View File

@ -11,7 +11,7 @@ maintainers:
- Woojung Huh <Woojung.Huh@microchip.com>
allOf:
- $ref: dsa.yaml#
- $ref: dsa.yaml#/$defs/ethernet-ports
- $ref: /schemas/spi/spi-peripheral-props.yaml#
properties:

View File

@ -10,7 +10,7 @@ maintainers:
- UNGLinuxDriver@microchip.com
allOf:
- $ref: dsa.yaml#
- $ref: dsa.yaml#/$defs/ethernet-ports
properties:
compatible:

View File

@ -78,7 +78,7 @@ required:
- reg
allOf:
- $ref: dsa.yaml#
- $ref: dsa.yaml#/$defs/ethernet-ports
- if:
properties:
compatible:

View File

@ -13,7 +13,7 @@ description:
depends on the SPI bus master driver.
allOf:
- $ref: "dsa.yaml#"
- $ref: dsa.yaml#/$defs/ethernet-ports
- $ref: /schemas/spi/spi-peripheral-props.yaml#
maintainers:

View File

@ -66,15 +66,11 @@ properties:
With the legacy mapping the reg corresponding to the internal
mdio is the switch reg with an offset of -1.
$ref: "dsa.yaml#"
patternProperties:
"^(ethernet-)?ports$":
type: object
properties:
'#address-cells':
const: 1
'#size-cells':
const: 0
patternProperties:
"^(ethernet-)?port@[0-6]$":
type: object
@ -116,7 +112,7 @@ required:
- compatible
- reg
additionalProperties: true
unevaluatedProperties: false
examples:
- |
@ -148,8 +144,6 @@ examples:
switch@10 {
compatible = "qca,qca8337";
#address-cells = <1>;
#size-cells = <0>;
reset-gpios = <&gpio 42 GPIO_ACTIVE_LOW>;
reg = <0x10>;
@ -209,8 +203,6 @@ examples:
switch@10 {
compatible = "qca,qca8337";
#address-cells = <1>;
#size-cells = <0>;
reset-gpios = <&gpio 42 GPIO_ACTIVE_LOW>;
reg = <0x10>;

View File

@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Realtek switches for unmanaged switches
allOf:
- $ref: dsa.yaml#
- $ref: dsa.yaml#/$defs/ethernet-ports
maintainers:
- Linus Walleij <linus.walleij@linaro.org>

View File

@ -14,7 +14,7 @@ description: |
handles 4 ports + 1 CPU management port.
allOf:
- $ref: dsa.yaml#
- $ref: dsa.yaml#/$defs/ethernet-ports
properties:
compatible:

View File

@ -0,0 +1,26 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/ethernet-switch-port.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Generic Ethernet Switch Port
maintainers:
- Andrew Lunn <andrew@lunn.ch>
- Florian Fainelli <f.fainelli@gmail.com>
- Vladimir Oltean <olteanv@gmail.com>
description:
An Ethernet switch port is a component of a switch that manages one MAC, and
can pass Ethernet frames.
$ref: ethernet-controller.yaml#
properties:
reg:
description: Port number
additionalProperties: true
...

View File

@ -0,0 +1,62 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/ethernet-switch.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Generic Ethernet Switch
maintainers:
- Andrew Lunn <andrew@lunn.ch>
- Florian Fainelli <f.fainelli@gmail.com>
- Vladimir Oltean <olteanv@gmail.com>
description:
Ethernet switches are multi-port Ethernet controllers. Each port has
its own number and is represented as its own Ethernet controller.
The minimum required functionality is to pass packets to software.
They may or may not be able to forward packets automonously between
ports.
select: false
properties:
$nodename:
pattern: "^(ethernet-)?switch(@.*)?$"
patternProperties:
"^(ethernet-)?ports$":
type: object
unevaluatedProperties: false
properties:
'#address-cells':
const: 1
'#size-cells':
const: 0
patternProperties:
"^(ethernet-)?port@[0-9]+$":
type: object
description: Ethernet switch ports
oneOf:
- required:
- ports
- required:
- ethernet-ports
additionalProperties: true
$defs:
base:
description: An ethernet switch without any extra port properties
$ref: '#/'
patternProperties:
"^(ethernet-)?port@[0-9]+$":
description: Ethernet switch ports
$ref: ethernet-switch-port.yaml#
unevaluatedProperties: false
...

View File

@ -51,6 +51,7 @@ properties:
- fsl,imx8mm-fec
- fsl,imx8mn-fec
- fsl,imx8mp-fec
- fsl,imx93-fec
- const: fsl,imx8mq-fec
- const: fsl,imx6sx-fec
- items:

View File

@ -0,0 +1,47 @@
# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/maxlinear,gpy2xx.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: MaxLinear GPY2xx PHY
maintainers:
- Andrew Lunn <andrew@lunn.ch>
- Michael Walle <michael@walle.cc>
allOf:
- $ref: ethernet-phy.yaml#
properties:
maxlinear,use-broken-interrupts:
description: |
Interrupts are broken on some GPY2xx PHYs in that they keep the
interrupt line asserted even after the interrupt status register is
cleared. Thus it is blocking the interrupt line which is usually bad
for shared lines. By default interrupts are disabled for this PHY and
polling mode is used. If one can live with the consequences, this
property can be used to enable interrupt handling.
Affected PHYs (as far as known) are GPY215B and GPY215C.
type: boolean
dependencies:
maxlinear,use-broken-interrupts: [ interrupts ]
unevaluatedProperties: false
examples:
- |
ethernet {
#address-cells = <1>;
#size-cells = <0>;
ethernet-phy@0 {
reg = <0>;
interrupts-extended = <&intc 0>;
maxlinear,use-broken-interrupts;
};
};
...

View File

@ -1,48 +0,0 @@
Properties for the MDIO bus multiplexer/glue of Amlogic G12a SoC family.
This is a special case of a MDIO bus multiplexer. It allows to choose between
the internal mdio bus leading to the embedded 10/100 PHY or the external
MDIO bus.
Required properties in addition to the generic multiplexer properties:
- compatible : amlogic,g12a-mdio-mux
- reg: physical address and length of the multiplexer/glue registers
- clocks: list of clock phandle, one for each entry clock-names.
- clock-names: should contain the following:
* "pclk" : peripheral clock.
* "clkin0" : platform crytal
* "clkin1" : SoC 50MHz MPLL
Example :
mdio_mux: mdio-multiplexer@4c000 {
compatible = "amlogic,g12a-mdio-mux";
reg = <0x0 0x4c000 0x0 0xa4>;
clocks = <&clkc CLKID_ETH_PHY>,
<&xtal>,
<&clkc CLKID_MPLL_5OM>;
clock-names = "pclk", "clkin0", "clkin1";
mdio-parent-bus = <&mdio0>;
#address-cells = <1>;
#size-cells = <0>;
ext_mdio: mdio@0 {
reg = <0>;
#address-cells = <1>;
#size-cells = <0>;
};
int_mdio: mdio@1 {
reg = <1>;
#address-cells = <1>;
#size-cells = <0>;
internal_ephy: ethernet-phy@8 {
compatible = "ethernet-phy-id0180.3301",
"ethernet-phy-ieee802.3-c22";
interrupts = <GIC_SPI 9 IRQ_TYPE_LEVEL_HIGH>;
reg = <8>;
max-speed = <100>;
};
};
};

View File

@ -158,6 +158,7 @@ KSZ9031:
no link will be established.
KSZ9131:
LAN8841:
All skew control options are specified in picoseconds. The increment
step is 100ps. Unlike KSZ9031, the values represent picoseccond delays.

View File

@ -0,0 +1,117 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/motorcomm,yt8xxx.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: MotorComm yt8xxx Ethernet PHY
maintainers:
- Frank Sae <frank.sae@motor-comm.com>
allOf:
- $ref: ethernet-phy.yaml#
properties:
compatible:
enum:
- ethernet-phy-id4f51.e91a
- ethernet-phy-id4f51.e91b
rx-internal-delay-ps:
description: |
RGMII RX Clock Delay used only when PHY operates in RGMII mode with
internal delay (phy-mode is 'rgmii-id' or 'rgmii-rxid') in pico-seconds.
enum: [ 0, 150, 300, 450, 600, 750, 900, 1050, 1200, 1350, 1500, 1650,
1800, 1900, 1950, 2050, 2100, 2200, 2250, 2350, 2500, 2650, 2800,
2950, 3100, 3250, 3400, 3550, 3700, 3850, 4000, 4150 ]
default: 1950
tx-internal-delay-ps:
description: |
RGMII TX Clock Delay used only when PHY operates in RGMII mode with
internal delay (phy-mode is 'rgmii-id' or 'rgmii-txid') in pico-seconds.
enum: [ 0, 150, 300, 450, 600, 750, 900, 1050, 1200, 1350, 1500, 1650, 1800,
1950, 2100, 2250 ]
default: 1950
motorcomm,clk-out-frequency-hz:
description: clock output on clock output pin.
enum: [0, 25000000, 125000000]
default: 0
motorcomm,keep-pll-enabled:
description: |
If set, keep the PLL enabled even if there is no link. Useful if you
want to use the clock output without an ethernet link.
type: boolean
motorcomm,auto-sleep-disabled:
description: |
If set, PHY will not enter sleep mode and close AFE after unplug cable
for a timer.
type: boolean
motorcomm,tx-clk-adj-enabled:
description: |
This configuration is mainly to adapt to VF2 with JH7110 SoC.
Useful if you want to use tx-clk-xxxx-inverted to adj the delay of tx clk.
type: boolean
motorcomm,tx-clk-10-inverted:
description: |
Use original or inverted RGMII Transmit PHY Clock to drive the RGMII
Transmit PHY Clock delay train configuration when speed is 10Mbps.
type: boolean
motorcomm,tx-clk-100-inverted:
description: |
Use original or inverted RGMII Transmit PHY Clock to drive the RGMII
Transmit PHY Clock delay train configuration when speed is 100Mbps.
type: boolean
motorcomm,tx-clk-1000-inverted:
description: |
Use original or inverted RGMII Transmit PHY Clock to drive the RGMII
Transmit PHY Clock delay train configuration when speed is 1000Mbps.
type: boolean
unevaluatedProperties: false
examples:
- |
mdio {
#address-cells = <1>;
#size-cells = <0>;
phy-mode = "rgmii-id";
ethernet-phy@4 {
/* Only needed to make DT lint tools work. Do not copy/paste
* into real DTS files.
*/
compatible = "ethernet-phy-id4f51.e91a";
reg = <4>;
rx-internal-delay-ps = <2100>;
tx-internal-delay-ps = <150>;
motorcomm,clk-out-frequency-hz = <0>;
motorcomm,keep-pll-enabled;
motorcomm,auto-sleep-disabled;
};
};
- |
mdio {
#address-cells = <1>;
#size-cells = <0>;
phy-mode = "rgmii";
ethernet-phy@5 {
/* Only needed to make DT lint tools work. Do not copy/paste
* into real DTS files.
*/
compatible = "ethernet-phy-id4f51.e91a";
reg = <5>;
motorcomm,clk-out-frequency-hz = <125000000>;
motorcomm,keep-pll-enabled;
motorcomm,auto-sleep-disabled;
};
};

View File

@ -18,14 +18,52 @@ description: |
packets using CPU. Additionally, PTP is supported as well as FDMA for faster
packet extraction/injection.
allOf:
- if:
properties:
$nodename:
pattern: "^switch@[0-9a-f]+$"
compatible:
const: mscc,vsc7514-switch
then:
$ref: ethernet-switch.yaml#
required:
- interrupts
- interrupt-names
properties:
reg:
minItems: 21
reg-names:
minItems: 21
ethernet-ports:
patternProperties:
"^port@[0-9a-f]+$":
$ref: ethernet-switch-port.yaml#
unevaluatedProperties: false
- if:
properties:
compatible:
const: mscc,vsc7512-switch
then:
$ref: /schemas/net/dsa/dsa.yaml#
properties:
reg:
maxItems: 20
reg-names:
maxItems: 20
ethernet-ports:
patternProperties:
"^port@[0-9a-f]+$":
$ref: /schemas/net/dsa/dsa-port.yaml#
unevaluatedProperties: false
properties:
compatible:
enum:
- mscc,vsc7512-switch
- mscc,vsc7514-switch
reg:
minItems: 20
items:
- description: system target
- description: rewriter target
@ -50,6 +88,7 @@ properties:
- description: fdma target
reg-names:
minItems: 20
items:
- const: sys
- const: rew
@ -87,59 +126,16 @@ properties:
- const: xtr
- const: fdma
ethernet-ports:
type: object
properties:
'#address-cells':
const: 1
'#size-cells':
const: 0
additionalProperties: false
patternProperties:
"^port@[0-9a-f]+$":
type: object
description: Ethernet ports handled by the switch
$ref: ethernet-controller.yaml#
unevaluatedProperties: false
properties:
reg:
description: Switch port number
phy-handle: true
phy-mode: true
fixed-link: true
mac-address: true
required:
- reg
- phy-mode
oneOf:
- required:
- phy-handle
- required:
- fixed-link
required:
- compatible
- reg
- reg-names
- interrupts
- interrupt-names
- ethernet-ports
additionalProperties: false
unevaluatedProperties: false
examples:
# VSC7514 (Switchdev)
- |
switch@1010000 {
compatible = "mscc,vsc7514-switch";
@ -187,5 +183,51 @@ examples:
};
};
};
# VSC7512 (DSA)
- |
ethernet-switch@1{
compatible = "mscc,vsc7512-switch";
reg = <0x71010000 0x10000>,
<0x71030000 0x10000>,
<0x71080000 0x100>,
<0x710e0000 0x10000>,
<0x711e0000 0x100>,
<0x711f0000 0x100>,
<0x71200000 0x100>,
<0x71210000 0x100>,
<0x71220000 0x100>,
<0x71230000 0x100>,
<0x71240000 0x100>,
<0x71250000 0x100>,
<0x71260000 0x100>,
<0x71270000 0x100>,
<0x71280000 0x100>,
<0x71800000 0x80000>,
<0x71880000 0x10000>,
<0x71040000 0x10000>,
<0x71050000 0x10000>,
<0x71060000 0x10000>;
reg-names = "sys", "rew", "qs", "ptp", "port0", "port1",
"port2", "port3", "port4", "port5", "port6",
"port7", "port8", "port9", "port10", "qsys",
"ana", "s0", "s1", "s2";
ethernet-ports {
#address-cells = <1>;
#size-cells = <0>;
port@0 {
reg = <0>;
ethernet = <&mac_sw>;
phy-handle = <&phy0>;
phy-mode = "internal";
};
port@1 {
reg = <1>;
phy-handle = <&phy1>;
phy-mode = "internal";
};
};
};
...

View File

@ -4,7 +4,7 @@
$id: http://devicetree.org/schemas/net/nxp,dwmac-imx.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: NXP i.MX8 DWMAC glue layer
title: NXP i.MX8/9 DWMAC glue layer
maintainers:
- Clark Wang <xiaoning.wang@nxp.com>
@ -19,6 +19,7 @@ select:
enum:
- nxp,imx8mp-dwmac-eqos
- nxp,imx8dxl-dwmac-eqos
- nxp,imx93-dwmac-eqos
required:
- compatible
@ -32,6 +33,7 @@ properties:
- enum:
- nxp,imx8mp-dwmac-eqos
- nxp,imx8dxl-dwmac-eqos
- nxp,imx93-dwmac-eqos
- const: snps,dwmac-5.10a
clocks:

View File

@ -0,0 +1,51 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/rfkill-gpio.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: GPIO controlled rfkill switch
maintainers:
- Johannes Berg <johannes@sipsolutions.net>
- Philipp Zabel <p.zabel@pengutronix.de>
properties:
compatible:
const: rfkill-gpio
label:
description: rfkill switch name, defaults to node name
radio-type:
description: rfkill radio type
enum:
- bluetooth
- fm
- gps
- nfc
- ultrawideband
- wimax
- wlan
- wwan
shutdown-gpios:
maxItems: 1
required:
- compatible
- radio-type
- shutdown-gpios
additionalProperties: false
examples:
- |
#include <dt-bindings/gpio/gpio.h>
rfkill {
compatible = "rfkill-gpio";
label = "rfkill-pcie-wlan";
radio-type = "wlan";
shutdown-gpios = <&gpio2 25 GPIO_ACTIVE_HIGH>;
};

View File

@ -49,11 +49,11 @@ properties:
- rockchip,rk3368-gmac
- rockchip,rk3399-gmac
- rockchip,rv1108-gmac
- rockchip,rv1126-gmac
- items:
- enum:
- rockchip,rk3568-gmac
- rockchip,rk3588-gmac
- rockchip,rv1126-gmac
- const: snps,dwmac-4.20a
clocks:

View File

@ -552,7 +552,7 @@ required:
dependencies:
snps,reset-active-low: ["snps,reset-gpio"]
snps,reset-delay-us: ["snps,reset-gpio"]
snps,reset-delays-us: ["snps,reset-gpio"]
allOf:
- $ref: "ethernet-controller.yaml#"

View File

@ -57,6 +57,7 @@ properties:
- ti,am654-cpsw-nuss
- ti,j7200-cpswxg-nuss
- ti,j721e-cpsw-nuss
- ti,j721e-cpswxg-nuss
- ti,am642-cpsw-nuss
reg:
@ -111,7 +112,7 @@ properties:
const: 0
patternProperties:
"^port@[1-4]$":
"^port@[1-8]$":
type: object
description: CPSWxG NUSS external ports
@ -121,7 +122,7 @@ properties:
properties:
reg:
minimum: 1
maximum: 4
maximum: 8
description: CPSW port number
phys:
@ -186,12 +187,36 @@ allOf:
properties:
compatible:
contains:
const: ti,j7200-cpswxg-nuss
const: ti,j721e-cpswxg-nuss
then:
properties:
ethernet-ports:
patternProperties:
"^port@[3-4]$": false
"^port@[5-8]$": false
"^port@[1-4]$":
properties:
reg:
minimum: 1
maximum: 4
- if:
not:
properties:
compatible:
contains:
enum:
- ti,j721e-cpswxg-nuss
- ti,j7200-cpswxg-nuss
then:
properties:
ethernet-ports:
patternProperties:
"^port@[3-8]$": false
"^port@[1-2]$":
properties:
reg:
minimum: 1
maximum: 2
additionalProperties: false

View File

@ -93,6 +93,14 @@ properties:
description:
Number of timestamp Generator function outputs (TS_GENFx)
ti,pps:
$ref: /schemas/types.yaml#/definitions/uint32-array
minItems: 2
maxItems: 2
description: |
The pair of HWx_TS_PUSH input and TS_GENFy output indexes used for
PPS events generation. Platform/board specific.
refclk-mux:
type: object
additionalProperties: false

View File

@ -1,6 +1,5 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
# Copyright (c) 2018-2019 The Linux Foundation. All rights reserved.
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/wireless/ieee80211.yaml#

View File

@ -1,4 +1,4 @@
Marvell 8787/8897/8997 (sd8787/sd8897/sd8997/pcie8997) SDIO/PCIE devices
Marvell 8787/8897/8978/8997 (sd8787/sd8897/sd8978/sd8997/pcie8997) SDIO/PCIE devices
------
This node provides properties for controlling the Marvell SDIO/PCIE wireless device.
@ -10,7 +10,9 @@ Required properties:
- compatible : should be one of the following:
* "marvell,sd8787"
* "marvell,sd8897"
* "marvell,sd8978"
* "marvell,sd8997"
* "nxp,iw416"
* "pci11ab,2b42"
* "pci1b4b,2b42"

View File

@ -1,6 +1,5 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
# Copyright (c) 2018-2019 The Linux Foundation. All rights reserved.
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/wireless/mediatek,mt76.yaml#

View File

@ -1,6 +1,5 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
# Copyright (c) 2018-2019 The Linux Foundation. All rights reserved.
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/wireless/qcom,ath11k.yaml#
@ -21,6 +20,7 @@ properties:
- qcom,ipq8074-wifi
- qcom,ipq6018-wifi
- qcom,wcn6750-wifi
- qcom,ipq5018-wifi
reg:
maxItems: 1
@ -262,10 +262,10 @@ allOf:
examples:
- |
q6v5_wcss: q6v5_wcss@CD00000 {
q6v5_wcss: remoteproc@cd00000 {
compatible = "qcom,ipq8074-wcss-pil";
reg = <0xCD00000 0x4040>,
<0x4AB000 0x20>;
reg = <0xcd00000 0x4040>,
<0x4ab000 0x20>;
reg-names = "qdsp6",
"rmb";
};
@ -386,7 +386,7 @@ examples:
#address-cells = <2>;
#size-cells = <2>;
qcn9074_0: qcn9074_0@51100000 {
qcn9074_0: wifi@51100000 {
no-map;
reg = <0x0 0x51100000 0x0 0x03500000>;
};

View File

@ -2,7 +2,6 @@
# Copyright (c) 2020, Silicon Laboratories, Inc.
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/wireless/silabs,wfx.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#

View File

@ -785,6 +785,8 @@ patternProperties:
description: MaxBotix Inc.
"^maxim,.*":
description: Maxim Integrated Products
"^maxlinear,.*":
description: MaxLinear Inc.
"^mbvl,.*":
description: Mobiveil Inc.
"^mcube,.*":
@ -855,6 +857,8 @@ patternProperties:
description: Moortec Semiconductor Ltd.
"^mosaixtech,.*":
description: Mosaix Technologies, Inc.
"^motorcomm,.*":
description: MotorComm, Inc.
"^motorola,.*":
description: Motorola, Inc.
"^moxa,.*":

View File

@ -323,7 +323,7 @@ If the lowest bit of showcapimsgs is set, kernelcapi logs controller and
application up and down events.
In addition, every registered CAPI controller has an associated traceflag
parameter controlling how CAPI messages sent from and to tha controller are
parameter controlling how CAPI messages sent from and to the controller are
logged. The traceflag parameter is initialized with the value of the
showcapimsgs parameter when the controller is registered, but can later be
changed via the MANUFACTURER_REQ command KCAPI_CMD_TRACE.

View File

@ -3,7 +3,7 @@ mISDN Driver
============
mISDN is a new modular ISDN driver, in the long term it should replace
the old I4L driver architecture for passiv ISDN cards.
the old I4L driver architecture for passive ISDN cards.
It was designed to allow a broad range of applications and interfaces
but only have the basic function in kernel, the interface to the user
space is based on sockets with a own address family AF_ISDN.

View File

@ -0,0 +1,331 @@
# SPDX-License-Identifier: GPL-2.0
%YAML 1.2
---
$id: http://kernel.org/schemas/netlink/genetlink-c.yaml#
$schema: https://json-schema.org/draft-07/schema
# Common defines
$defs:
uint:
type: integer
minimum: 0
len-or-define:
type: [ string, integer ]
pattern: ^[0-9A-Za-z_]+( - 1)?$
minimum: 0
# Schema for specs
title: Protocol
description: Specification of a genetlink protocol
type: object
required: [ name, doc, attribute-sets, operations ]
additionalProperties: False
properties:
name:
description: Name of the genetlink family.
type: string
doc:
type: string
version:
description: Generic Netlink family version. Default is 1.
type: integer
minimum: 1
protocol:
description: Schema compatibility level. Default is "genetlink".
enum: [ genetlink, genetlink-c ]
# Start genetlink-c
uapi-header:
description: Path to the uAPI header, default is linux/${family-name}.h
type: string
c-family-name:
description: Name of the define for the family name.
type: string
c-version-name:
description: Name of the define for the verion of the family.
type: string
max-by-define:
description: Makes the number of attributes and commands be specified by a define, not an enum value.
type: boolean
# End genetlink-c
definitions:
description: List of type and constant definitions (enums, flags, defines).
type: array
items:
type: object
required: [ type, name ]
additionalProperties: False
properties:
name:
type: string
header:
description: For C-compatible languages, header which already defines this value.
type: string
type:
enum: [ const, enum, flags ]
doc:
type: string
# For const
value:
description: For const - the value.
type: [ string, integer ]
# For enum and flags
value-start:
description: For enum or flags the literal initializer for the first value.
type: [ string, integer ]
entries:
description: For enum or flags array of values.
type: array
items:
oneOf:
- type: string
- type: object
required: [ name ]
additionalProperties: False
properties:
name:
type: string
value:
type: integer
doc:
type: string
render-max:
description: Render the max members for this enum.
type: boolean
# Start genetlink-c
enum-name:
description: Name for enum, if empty no name will be used.
type: [ string, "null" ]
name-prefix:
description: For enum the prefix of the values, optional.
type: string
# End genetlink-c
attribute-sets:
description: Definition of attribute spaces for this family.
type: array
items:
description: Definition of a single attribute space.
type: object
required: [ name, attributes ]
additionalProperties: False
properties:
name:
description: |
Name used when referring to this space in other definitions, not used outside of the spec.
type: string
name-prefix:
description: |
Prefix for the C enum name of the attributes. Default family[name]-set[name]-a-
type: string
enum-name:
description: Name for the enum type of the attribute.
type: string
doc:
description: Documentation of the space.
type: string
subset-of:
description: |
Name of another space which this is a logical part of. Sub-spaces can be used to define
a limited group of attributes which are used in a nest.
type: string
# Start genetlink-c
attr-cnt-name:
description: The explicit name for constant holding the count of attributes (last attr + 1).
type: string
attr-max-name:
description: The explicit name for last member of attribute enum.
type: string
# End genetlink-c
attributes:
description: List of attributes in the space.
type: array
items:
type: object
required: [ name, type ]
additionalProperties: False
properties:
name:
type: string
type: &attr-type
enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64,
string, nest, array-nest, nest-type-value ]
doc:
description: Documentation of the attribute.
type: string
value:
description: Value for the enum item representing this attribute in the uAPI.
$ref: '#/$defs/uint'
type-value:
description: Name of the value extracted from the type of a nest-type-value attribute.
type: array
items:
type: string
byte-order:
enum: [ little-endian, big-endian ]
multi-attr:
type: boolean
nested-attributes:
description: Name of the space (sub-space) used inside the attribute.
type: string
enum:
description: Name of the enum type used for the attribute.
type: string
enum-as-flags:
description: |
Treat the enum as flags. In most cases enum is either used as flags or as values.
Sometimes, however, both forms are necessary, in which case header contains the enum
form while specific attributes may request to convert the values into a bitfield.
type: boolean
checks:
description: Kernel input validation.
type: object
additionalProperties: False
properties:
flags-mask:
description: Name of the flags constant on which to base mask (unsigned scalar types only).
type: string
min:
description: Min value for an integer attribute.
type: integer
min-len:
description: Min length for a binary attribute.
$ref: '#/$defs/len-or-define'
max-len:
description: Max length for a string or a binary attribute.
$ref: '#/$defs/len-or-define'
sub-type: *attr-type
# Make sure name-prefix does not appear in subsets (subsets inherit naming)
dependencies:
name-prefix:
not:
required: [ subset-of ]
subset-of:
not:
required: [ name-prefix ]
operations:
description: Operations supported by the protocol.
type: object
required: [ list ]
additionalProperties: False
properties:
enum-model:
description: |
The model of assigning values to the operations.
"unified" is the recommended model where all message types belong
to a single enum.
"directional" has the messages sent to the kernel and from the kernel
enumerated separately.
enum: [ unified ]
name-prefix:
description: |
Prefix for the C enum name of the command. The name is formed by concatenating
the prefix with the upper case name of the command, with dashes replaced by underscores.
type: string
enum-name:
description: Name for the enum type with commands.
type: string
async-prefix:
description: Same as name-prefix but used to render notifications and events to separate enum.
type: string
async-enum:
description: Name for the enum type with notifications/events.
type: string
list:
description: List of commands
type: array
items:
type: object
additionalProperties: False
required: [ name, doc ]
properties:
name:
description: Name of the operation, also defining its C enum value in uAPI.
type: string
doc:
description: Documentation for the command.
type: string
value:
description: Value for the enum in the uAPI.
$ref: '#/$defs/uint'
attribute-set:
description: |
Attribute space from which attributes directly in the requests and replies
to this command are defined.
type: string
flags: &cmd_flags
description: Command flags.
type: array
items:
enum: [ admin-perm ]
dont-validate:
description: Kernel attribute validation flags.
type: array
items:
enum: [ strict, dump ]
do: &subop-type
description: Main command handler.
type: object
additionalProperties: False
properties:
request: &subop-attr-list
description: Definition of the request message for a given command.
type: object
additionalProperties: False
properties:
attributes:
description: |
Names of attributes from the attribute-set (not full attribute
definitions, just names).
type: array
items:
type: string
reply: *subop-attr-list
pre:
description: Hook for a function to run before the main callback (pre_doit or start).
type: string
post:
description: Hook for a function to run after the main callback (post_doit or done).
type: string
dump: *subop-type
notify:
description: Name of the command sharing the reply type with this notification.
type: string
event:
type: object
additionalProperties: False
properties:
attributes:
description: Explicit list of the attributes for the notification.
type: array
items:
type: string
mcgrp:
description: Name of the multicast group generating given notification.
type: string
mcast-groups:
description: List of multicast groups.
type: object
required: [ list ]
additionalProperties: False
properties:
list:
description: List of groups.
type: array
items:
type: object
required: [ name ]
additionalProperties: False
properties:
name:
description: |
The name for the group, used to form the define and the value of the define.
type: string
# Start genetlink-c
c-define-name:
description: Override for the name of the define in C uAPI.
type: string
# End genetlink-c
flags: *cmd_flags

View File

@ -0,0 +1,361 @@
# SPDX-License-Identifier: GPL-2.0
%YAML 1.2
---
$id: http://kernel.org/schemas/netlink/genetlink-legacy.yaml#
$schema: https://json-schema.org/draft-07/schema
# Common defines
$defs:
uint:
type: integer
minimum: 0
len-or-define:
type: [ string, integer ]
pattern: ^[0-9A-Za-z_]+( - 1)?$
minimum: 0
# Schema for specs
title: Protocol
description: Specification of a genetlink protocol
type: object
required: [ name, doc, attribute-sets, operations ]
additionalProperties: False
properties:
name:
description: Name of the genetlink family.
type: string
doc:
type: string
version:
description: Generic Netlink family version. Default is 1.
type: integer
minimum: 1
protocol:
description: Schema compatibility level. Default is "genetlink".
enum: [ genetlink, genetlink-c, genetlink-legacy ] # Trim
# Start genetlink-c
uapi-header:
description: Path to the uAPI header, default is linux/${family-name}.h
type: string
c-family-name:
description: Name of the define for the family name.
type: string
c-version-name:
description: Name of the define for the verion of the family.
type: string
max-by-define:
description: Makes the number of attributes and commands be specified by a define, not an enum value.
type: boolean
# End genetlink-c
# Start genetlink-legacy
kernel-policy:
description: |
Defines if the input policy in the kernel is global, per-operation, or split per operation type.
Default is split.
enum: [ split, per-op, global ]
# End genetlink-legacy
definitions:
description: List of type and constant definitions (enums, flags, defines).
type: array
items:
type: object
required: [ type, name ]
additionalProperties: False
properties:
name:
type: string
header:
description: For C-compatible languages, header which already defines this value.
type: string
type:
enum: [ const, enum, flags, struct ] # Trim
doc:
type: string
# For const
value:
description: For const - the value.
type: [ string, integer ]
# For enum and flags
value-start:
description: For enum or flags the literal initializer for the first value.
type: [ string, integer ]
entries:
description: For enum or flags array of values.
type: array
items:
oneOf:
- type: string
- type: object
required: [ name ]
additionalProperties: False
properties:
name:
type: string
value:
type: integer
doc:
type: string
render-max:
description: Render the max members for this enum.
type: boolean
# Start genetlink-c
enum-name:
description: Name for enum, if empty no name will be used.
type: [ string, "null" ]
name-prefix:
description: For enum the prefix of the values, optional.
type: string
# End genetlink-c
# Start genetlink-legacy
members:
description: List of struct members. Only scalars and strings members allowed.
type: array
items:
type: object
required: [ name, type ]
additionalProperties: False
properties:
name:
type: string
type:
enum: [ u8, u16, u32, u64, s8, s16, s32, s64, string ]
len:
$ref: '#/$defs/len-or-define'
# End genetlink-legacy
attribute-sets:
description: Definition of attribute spaces for this family.
type: array
items:
description: Definition of a single attribute space.
type: object
required: [ name, attributes ]
additionalProperties: False
properties:
name:
description: |
Name used when referring to this space in other definitions, not used outside of the spec.
type: string
name-prefix:
description: |
Prefix for the C enum name of the attributes. Default family[name]-set[name]-a-
type: string
enum-name:
description: Name for the enum type of the attribute.
type: string
doc:
description: Documentation of the space.
type: string
subset-of:
description: |
Name of another space which this is a logical part of. Sub-spaces can be used to define
a limited group of attributes which are used in a nest.
type: string
# Start genetlink-c
attr-cnt-name:
description: The explicit name for constant holding the count of attributes (last attr + 1).
type: string
attr-max-name:
description: The explicit name for last member of attribute enum.
type: string
# End genetlink-c
attributes:
description: List of attributes in the space.
type: array
items:
type: object
required: [ name, type ]
additionalProperties: False
properties:
name:
type: string
type: &attr-type
enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64,
string, nest, array-nest, nest-type-value ]
doc:
description: Documentation of the attribute.
type: string
value:
description: Value for the enum item representing this attribute in the uAPI.
$ref: '#/$defs/uint'
type-value:
description: Name of the value extracted from the type of a nest-type-value attribute.
type: array
items:
type: string
byte-order:
enum: [ little-endian, big-endian ]
multi-attr:
type: boolean
nested-attributes:
description: Name of the space (sub-space) used inside the attribute.
type: string
enum:
description: Name of the enum type used for the attribute.
type: string
enum-as-flags:
description: |
Treat the enum as flags. In most cases enum is either used as flags or as values.
Sometimes, however, both forms are necessary, in which case header contains the enum
form while specific attributes may request to convert the values into a bitfield.
type: boolean
checks:
description: Kernel input validation.
type: object
additionalProperties: False
properties:
flags-mask:
description: Name of the flags constant on which to base mask (unsigned scalar types only).
type: string
min:
description: Min value for an integer attribute.
type: integer
min-len:
description: Min length for a binary attribute.
$ref: '#/$defs/len-or-define'
max-len:
description: Max length for a string or a binary attribute.
$ref: '#/$defs/len-or-define'
sub-type: *attr-type
# Make sure name-prefix does not appear in subsets (subsets inherit naming)
dependencies:
name-prefix:
not:
required: [ subset-of ]
subset-of:
not:
required: [ name-prefix ]
operations:
description: Operations supported by the protocol.
type: object
required: [ list ]
additionalProperties: False
properties:
enum-model:
description: |
The model of assigning values to the operations.
"unified" is the recommended model where all message types belong
to a single enum.
"directional" has the messages sent to the kernel and from the kernel
enumerated separately.
enum: [ unified, directional ] # Trim
name-prefix:
description: |
Prefix for the C enum name of the command. The name is formed by concatenating
the prefix with the upper case name of the command, with dashes replaced by underscores.
type: string
enum-name:
description: Name for the enum type with commands.
type: string
async-prefix:
description: Same as name-prefix but used to render notifications and events to separate enum.
type: string
async-enum:
description: Name for the enum type with notifications/events.
type: string
list:
description: List of commands
type: array
items:
type: object
additionalProperties: False
required: [ name, doc ]
properties:
name:
description: Name of the operation, also defining its C enum value in uAPI.
type: string
doc:
description: Documentation for the command.
type: string
value:
description: Value for the enum in the uAPI.
$ref: '#/$defs/uint'
attribute-set:
description: |
Attribute space from which attributes directly in the requests and replies
to this command are defined.
type: string
flags: &cmd_flags
description: Command flags.
type: array
items:
enum: [ admin-perm ]
dont-validate:
description: Kernel attribute validation flags.
type: array
items:
enum: [ strict, dump ]
do: &subop-type
description: Main command handler.
type: object
additionalProperties: False
properties:
request: &subop-attr-list
description: Definition of the request message for a given command.
type: object
additionalProperties: False
properties:
attributes:
description: |
Names of attributes from the attribute-set (not full attribute
definitions, just names).
type: array
items:
type: string
# Start genetlink-legacy
value:
description: |
ID of this message if value for request and response differ,
i.e. requests and responses have different message enums.
$ref: '#/$defs/uint'
# End genetlink-legacy
reply: *subop-attr-list
pre:
description: Hook for a function to run before the main callback (pre_doit or start).
type: string
post:
description: Hook for a function to run after the main callback (post_doit or done).
type: string
dump: *subop-type
notify:
description: Name of the command sharing the reply type with this notification.
type: string
event:
type: object
additionalProperties: False
properties:
attributes:
description: Explicit list of the attributes for the notification.
type: array
items:
type: string
mcgrp:
description: Name of the multicast group generating given notification.
type: string
mcast-groups:
description: List of multicast groups.
type: object
required: [ list ]
additionalProperties: False
properties:
list:
description: List of groups.
type: array
items:
type: object
required: [ name ]
additionalProperties: False
properties:
name:
description: |
The name for the group, used to form the define and the value of the define.
type: string
# Start genetlink-c
c-define-name:
description: Override for the name of the define in C uAPI.
type: string
# End genetlink-c
flags: *cmd_flags

View File

@ -0,0 +1,296 @@
# SPDX-License-Identifier: GPL-2.0
%YAML 1.2
---
$id: http://kernel.org/schemas/netlink/genetlink-legacy.yaml#
$schema: https://json-schema.org/draft-07/schema
# Common defines
$defs:
uint:
type: integer
minimum: 0
len-or-define:
type: [ string, integer ]
pattern: ^[0-9A-Za-z_]+( - 1)?$
minimum: 0
# Schema for specs
title: Protocol
description: Specification of a genetlink protocol
type: object
required: [ name, doc, attribute-sets, operations ]
additionalProperties: False
properties:
name:
description: Name of the genetlink family.
type: string
doc:
type: string
version:
description: Generic Netlink family version. Default is 1.
type: integer
minimum: 1
protocol:
description: Schema compatibility level. Default is "genetlink".
enum: [ genetlink ]
definitions:
description: List of type and constant definitions (enums, flags, defines).
type: array
items:
type: object
required: [ type, name ]
additionalProperties: False
properties:
name:
type: string
header:
description: For C-compatible languages, header which already defines this value.
type: string
type:
enum: [ const, enum, flags ]
doc:
type: string
# For const
value:
description: For const - the value.
type: [ string, integer ]
# For enum and flags
value-start:
description: For enum or flags the literal initializer for the first value.
type: [ string, integer ]
entries:
description: For enum or flags array of values.
type: array
items:
oneOf:
- type: string
- type: object
required: [ name ]
additionalProperties: False
properties:
name:
type: string
value:
type: integer
doc:
type: string
render-max:
description: Render the max members for this enum.
type: boolean
attribute-sets:
description: Definition of attribute spaces for this family.
type: array
items:
description: Definition of a single attribute space.
type: object
required: [ name, attributes ]
additionalProperties: False
properties:
name:
description: |
Name used when referring to this space in other definitions, not used outside of the spec.
type: string
name-prefix:
description: |
Prefix for the C enum name of the attributes. Default family[name]-set[name]-a-
type: string
enum-name:
description: Name for the enum type of the attribute.
type: string
doc:
description: Documentation of the space.
type: string
subset-of:
description: |
Name of another space which this is a logical part of. Sub-spaces can be used to define
a limited group of attributes which are used in a nest.
type: string
attributes:
description: List of attributes in the space.
type: array
items:
type: object
required: [ name, type ]
additionalProperties: False
properties:
name:
type: string
type: &attr-type
enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64,
string, nest, array-nest, nest-type-value ]
doc:
description: Documentation of the attribute.
type: string
value:
description: Value for the enum item representing this attribute in the uAPI.
$ref: '#/$defs/uint'
type-value:
description: Name of the value extracted from the type of a nest-type-value attribute.
type: array
items:
type: string
byte-order:
enum: [ little-endian, big-endian ]
multi-attr:
type: boolean
nested-attributes:
description: Name of the space (sub-space) used inside the attribute.
type: string
enum:
description: Name of the enum type used for the attribute.
type: string
enum-as-flags:
description: |
Treat the enum as flags. In most cases enum is either used as flags or as values.
Sometimes, however, both forms are necessary, in which case header contains the enum
form while specific attributes may request to convert the values into a bitfield.
type: boolean
checks:
description: Kernel input validation.
type: object
additionalProperties: False
properties:
flags-mask:
description: Name of the flags constant on which to base mask (unsigned scalar types only).
type: string
min:
description: Min value for an integer attribute.
type: integer
min-len:
description: Min length for a binary attribute.
$ref: '#/$defs/len-or-define'
max-len:
description: Max length for a string or a binary attribute.
$ref: '#/$defs/len-or-define'
sub-type: *attr-type
# Make sure name-prefix does not appear in subsets (subsets inherit naming)
dependencies:
name-prefix:
not:
required: [ subset-of ]
subset-of:
not:
required: [ name-prefix ]
operations:
description: Operations supported by the protocol.
type: object
required: [ list ]
additionalProperties: False
properties:
enum-model:
description: |
The model of assigning values to the operations.
"unified" is the recommended model where all message types belong
to a single enum.
"directional" has the messages sent to the kernel and from the kernel
enumerated separately.
enum: [ unified ]
name-prefix:
description: |
Prefix for the C enum name of the command. The name is formed by concatenating
the prefix with the upper case name of the command, with dashes replaced by underscores.
type: string
enum-name:
description: Name for the enum type with commands.
type: string
async-prefix:
description: Same as name-prefix but used to render notifications and events to separate enum.
type: string
async-enum:
description: Name for the enum type with notifications/events.
type: string
list:
description: List of commands
type: array
items:
type: object
additionalProperties: False
required: [ name, doc ]
properties:
name:
description: Name of the operation, also defining its C enum value in uAPI.
type: string
doc:
description: Documentation for the command.
type: string
value:
description: Value for the enum in the uAPI.
$ref: '#/$defs/uint'
attribute-set:
description: |
Attribute space from which attributes directly in the requests and replies
to this command are defined.
type: string
flags: &cmd_flags
description: Command flags.
type: array
items:
enum: [ admin-perm ]
dont-validate:
description: Kernel attribute validation flags.
type: array
items:
enum: [ strict, dump ]
do: &subop-type
description: Main command handler.
type: object
additionalProperties: False
properties:
request: &subop-attr-list
description: Definition of the request message for a given command.
type: object
additionalProperties: False
properties:
attributes:
description: |
Names of attributes from the attribute-set (not full attribute
definitions, just names).
type: array
items:
type: string
reply: *subop-attr-list
pre:
description: Hook for a function to run before the main callback (pre_doit or start).
type: string
post:
description: Hook for a function to run after the main callback (post_doit or done).
type: string
dump: *subop-type
notify:
description: Name of the command sharing the reply type with this notification.
type: string
event:
type: object
additionalProperties: False
properties:
attributes:
description: Explicit list of the attributes for the notification.
type: array
items:
type: string
mcgrp:
description: Name of the multicast group generating given notification.
type: string
mcast-groups:
description: List of multicast groups.
type: object
required: [ list ]
additionalProperties: False
properties:
list:
description: List of groups.
type: array
items:
type: object
required: [ name ]
additionalProperties: False
properties:
name:
description: |
The name for the group, used to form the define and the value of the define.
type: string
flags: *cmd_flags

View File

@ -0,0 +1,397 @@
name: ethtool
protocol: genetlink-legacy
doc: Partial family for Ethtool Netlink.
attribute-sets:
-
name: header
attributes:
-
name: dev-index
type: u32
value: 1
-
name: dev-name
type: string
-
name: flags
type: u32
-
name: bitset-bit
attributes:
-
name: index
type: u32
value: 1
-
name: name
type: string
-
name: value
type: flag
-
name: bitset-bits
attributes:
-
name: bit
type: nest
nested-attributes: bitset-bit
value: 1
-
name: bitset
attributes:
-
name: nomask
type: flag
value: 1
-
name: size
type: u32
-
name: bits
type: nest
nested-attributes: bitset-bits
-
name: string
attributes:
-
name: index
type: u32
value: 1
-
name: value
type: string
-
name: strings
attributes:
-
name: string
type: nest
value: 1
multi-attr: true
nested-attributes: string
-
name: stringset
attributes:
-
name: id
type: u32
value: 1
-
name: count
type: u32
-
name: strings
type: nest
multi-attr: true
nested-attributes: strings
-
name: stringsets
attributes:
-
name: stringset
type: nest
multi-attr: true
value: 1
nested-attributes: stringset
-
name: strset
attributes:
-
name: header
value: 1
type: nest
nested-attributes: header
-
name: stringsets
type: nest
nested-attributes: stringsets
-
name: counts-only
type: flag
-
name: privflags
attributes:
-
name: header
value: 1
type: nest
nested-attributes: header
-
name: flags
type: nest
nested-attributes: bitset
-
name: rings
attributes:
-
name: header
value: 1
type: nest
nested-attributes: header
-
name: rx-max
type: u32
-
name: rx-mini-max
type: u32
-
name: rx-jumbo-max
type: u32
-
name: tx-max
type: u32
-
name: rx
type: u32
-
name: rx-mini
type: u32
-
name: rx-jumbo
type: u32
-
name: tx
type: u32
-
name: rx-buf-len
type: u32
-
name: tcp-data-split
type: u8
-
name: cqe-size
type: u32
-
name: tx-push
type: u8
-
name: rx-push
type: u8
-
name: mm-stat
attributes:
-
name: pad
value: 1
type: pad
-
name: reassembly-errors
type: u64
-
name: smd-errors
type: u64
-
name: reassembly-ok
type: u64
-
name: rx-frag-count
type: u64
-
name: tx-frag-count
type: u64
-
name: hold-count
type: u64
-
name: mm
attributes:
-
name: header
value: 1
type: nest
nested-attributes: header
-
name: pmac-enabled
type: u8
-
name: tx-enabled
type: u8
-
name: tx-active
type: u8
-
name: tx-min-frag-size
type: u32
-
name: tx-min-frag-size
type: u32
-
name: verify-enabled
type: u8
-
name: verify-status
type: u8
-
name: verify-time
type: u32
-
name: max-verify-time
type: u32
-
name: stats
type: nest
nested-attributes: mm-stat
operations:
enum-model: directional
list:
-
name: strset-get
doc: Get string set from the kernel.
attribute-set: strset
do: &strset-get-op
request:
value: 1
attributes:
- header
- stringsets
- counts-only
reply:
value: 1
attributes:
- header
- stringsets
dump: *strset-get-op
# TODO: fill in the requests in between
-
name: privflags-get
doc: Get device private flags.
attribute-set: privflags
do: &privflag-get-op
request:
value: 13
attributes:
- header
reply:
value: 14
attributes:
- header
- flags
dump: *privflag-get-op
-
name: privflags-set
doc: Set device private flags.
attribute-set: privflags
do:
request:
attributes:
- header
- flags
-
name: privflags-ntf
doc: Notification for change in device private flags.
notify: privflags-get
-
name: rings-get
doc: Get ring params.
attribute-set: rings
do: &ring-get-op
request:
attributes:
- header
reply:
attributes:
- header
- rx-max
- rx-mini-max
- rx-jumbo-max
- tx-max
- rx
- rx-mini
- rx-jumbo
- tx
- rx-buf-len
- tcp-data-split
- cqe-size
- tx-push
- rx-push
dump: *ring-get-op
-
name: rings-set
doc: Set ring params.
attribute-set: rings
do:
request:
attributes:
- header
- rx
- rx-mini
- rx-jumbo
- tx
- rx-buf-len
- tcp-data-split
- cqe-size
- tx-push
- rx-push
-
name: rings-ntf
doc: Notification for change in ring params.
notify: rings-get
# TODO: fill in the requests in between
-
name: mm-get
doc: Get MAC Merge configuration and state
attribute-set: mm
do: &mm-get-op
request:
value: 42
attributes:
- header
reply:
value: 42
attributes:
- header
- pmac-enabled
- tx-enabled
- tx-active
- tx-min-frag-size
- rx-min-frag-size
- verify-enabled
- verify-time
- max-verify-time
- stats
dump: *mm-get-op
-
name: mm-set
doc: Set MAC Merge configuration
attribute-set: mm
do:
request:
attributes:
- header
- verify-enabled
- verify-time
- tx-enabled
- pmac-enabled
- tx-min-frag-size
-
name: mm-ntf
doc: Notification for change in MAC Merge configuration.
notify: mm-get

View File

@ -0,0 +1,128 @@
name: fou
protocol: genetlink-legacy
doc: |
Foo-over-UDP.
c-family-name: fou-genl-name
c-version-name: fou-genl-version
max-by-define: true
kernel-policy: global
definitions:
-
type: enum
name: encap_type
name-prefix: fou-encap-
enum-name:
entries: [ unspec, direct, gue ]
attribute-sets:
-
name: fou
name-prefix: fou-attr-
attributes:
-
name: unspec
type: unused
-
name: port
type: u16
byte-order: big-endian
-
name: af
type: u8
-
name: ipproto
type: u8
-
name: type
type: u8
-
name: remcsum_nopartial
type: flag
-
name: local_v4
type: u32
-
name: local_v6
type: binary
checks:
min-len: 16
-
name: peer_v4
type: u32
-
name: peer_v6
type: binary
checks:
min-len: 16
-
name: peer_port
type: u16
byte-order: big-endian
-
name: ifindex
type: s32
operations:
list:
-
name: unspec
doc: unused
-
name: add
doc: Add port.
attribute-set: fou
dont-validate: [ strict, dump ]
flags: [ admin-perm ]
do:
request: &all_attrs
attributes:
- port
- ipproto
- type
- remcsum_nopartial
- local_v4
- peer_v4
- local_v6
- peer_v6
- peer_port
- ifindex
-
name: del
doc: Delete port.
attribute-set: fou
dont-validate: [ strict, dump ]
flags: [ admin-perm ]
do:
request: &select_attrs
attributes:
- af
- ifindex
- port
- peer_port
- local_v4
- peer_v4
- local_v6
- peer_v6
-
name: get
doc: Get tunnel info.
attribute-set: fou
dont-validate: [ strict, dump ]
do:
request: *select_attrs
reply: *all_attrs
dump:
reply: *all_attrs

View File

@ -0,0 +1,100 @@
name: netdev
doc:
netdev configuration over generic netlink.
definitions:
-
type: flags
name: xdp-act
entries:
-
name: basic
doc:
XDP feautues set supported by all drivers
(XDP_ABORTED, XDP_DROP, XDP_PASS, XDP_TX)
-
name: redirect
doc:
The netdev supports XDP_REDIRECT
-
name: ndo-xmit
doc:
This feature informs if netdev implements ndo_xdp_xmit callback.
-
name: xsk-zerocopy
doc:
This feature informs if netdev supports AF_XDP in zero copy mode.
-
name: hw-offload
doc:
This feature informs if netdev supports XDP hw oflloading.
-
name: rx-sg
doc:
This feature informs if netdev implements non-linear XDP buffer
support in the driver napi callback.
-
name: ndo-xmit-sg
doc:
This feature informs if netdev implements non-linear XDP buffer
support in ndo_xdp_xmit callback.
attribute-sets:
-
name: dev
attributes:
-
name: ifindex
doc: netdev ifindex
type: u32
value: 1
checks:
min: 1
-
name: pad
type: pad
-
name: xdp-features
doc: Bitmask of enabled xdp-features.
type: u64
enum: xdp-act
enum-as-flags: true
operations:
list:
-
name: dev-get
doc: Get / dump information about a netdev.
value: 1
attribute-set: dev
do:
request:
attributes:
- ifindex
reply: &dev-all
attributes:
- ifindex
- xdp-features
dump:
reply: *dev-all
-
name: dev-add-ntf
doc: Notification about device appearing.
notify: dev-get
mcgrp: mgmt
-
name: dev-del-ntf
doc: Notification about device disappearing.
notify: dev-get
mcgrp: mgmt
-
name: dev-change-ntf
doc: Notification about device configuration being changed.
notify: dev-get
mcgrp: mgmt
mcast-groups:
list:
-
name: mgmt

View File

@ -419,7 +419,7 @@ XDP_UMEM_REG setsockopt
-----------------------
This setsockopt registers a UMEM to a socket. This is the area that
contain all the buffers that packet can recide in. The call takes a
contain all the buffers that packet can reside in. The call takes a
pointer to the beginning of this area and the size of it. Moreover, it
also has parameter called chunk_size that is the size that the UMEM is
divided into. It can only be 2K or 4K at the moment. If you have an
@ -592,7 +592,7 @@ A: When a netdev of a physical NIC is initialized, Linux usually
A number of other ways are possible all up to the capabilities of
the NIC you have.
Q: Can I use the XSKMAP to implement a switch betwen different umems
Q: Can I use the XSKMAP to implement a switch between different umems
in copy mode?
A: The short answer is no, that is not supported at the moment. The

View File

@ -1902,7 +1902,7 @@ of 32 possible I/O Base addresses using the following tables::
6 | 10
The I/O address is sum of all switches set to "1". Remember that
the I/O address space bellow 0x200 is RESERVED for mainboard, so
the I/O address space below 0x200 is RESERVED for mainboard, so
switch 1 should be ALWAYS SET TO OFF.

View File

@ -159,7 +159,7 @@ Please send us comments, experiences, questions, anything :)
IRC:
#batadv on ircs://irc.hackint.org/
Mailing-list:
b.a.t.m.a.n@open-mesh.org (optional subscription at
b.a.t.m.a.n@lists.open-mesh.org (optional subscription at
https://lists.open-mesh.org/mailman3/postorius/lists/b.a.t.m.a.n.lists.open-mesh.org/)
You can also contact the Authors:

View File

@ -931,7 +931,7 @@ ival1:
ival2:
Throttle the received message rate down to the value of ival2. This
is useful to reduce messages for the application when the signal inside the
CAN frame is stateless as state changes within the ival2 periode may get
CAN frame is stateless as state changes within the ival2 period may get
lost.
Broadcast Manager Multiplex Message Receive Filter

View File

@ -50,7 +50,7 @@ Setup Packet
``wIndex`` USB Interface Index (0 for device commands)
``wLength`` * Host to Device - Number of bytes to transmit
* Device to Host - Maximum Number of bytes to
receive. If the device send less. Commom ZLP
receive. If the device send less. Common ZLP
semantics are used.
================= =====================================================

View File

@ -93,7 +93,7 @@ MBIM function can be looked up using sysfs. For example::
USB configuration descriptors
-----------------------------
The wMaxControlMessage field of the CDC MBIM functional descriptor
limits the maximum control message size. The managament application is
limits the maximum control message size. The management application is
responsible for negotiating a control message size complying with the
requirements in section 9.3.1 of [1], taking this descriptor field
into consideration.

View File

@ -4,7 +4,7 @@
ATM (i)Chip IA Linux Driver Source
==================================
READ ME FISRT
READ ME FIRST
--------------------------------------------------------------------------------

View File

@ -577,7 +577,7 @@ CTU CAN FD IP Core and Driver Development Acknowledgment
* Linux driver development
* continuous integration platform architect and GHDL updates
* theses `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_
* thesis `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_
* Jiri Novak <jnovak@fel.cvut.cz>
@ -603,7 +603,7 @@ CTU CAN FD IP Core and Driver Development Acknowledgment
* Jan Charvat
* implemented CTU CAN FD functional model for QEMU which has been integrated into QEMU mainline (`docs/system/devices/can.rst <https://www.qemu.org/docs/master/system/devices/can.html>`_)
* Bachelor theses Model of CAN FD Communication Controller for QEMU Emulator
* Bachelor thesis Model of CAN FD Communication Controller for QEMU Emulator
Notes
-----

View File

@ -129,10 +129,10 @@
</g>
</g>
<text transform="matrix(.264583 0 0 .264583 91.8919 139.964)" x="26.959213" y="9.11724" fill="#2aa1ff" filter="url(#filter1204-6-2-9-1-3-1)" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="26.959213" y="9.11724" text-align="center">Set</tspan><tspan x="26.959213" y="22.31724" text-align="center">abort</tspan></text>
<text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccesfull</tspan></text>
<text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccessful</tspan></text>
<g font-size="12px" stroke-width="3.77953" text-anchor="middle">
<text transform="matrix(.264583 0 0 .264583 68.5988 118.913)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">starts</tspan></text>
<text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">succesfull</tspan></text>
<text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">successful</tspan></text>
<text transform="matrix(.264583 0 0 .264583 107.77 145.476)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">sborted</tspan></text>
</g>
<g stroke-width="3.77953" text-anchor="middle">

Before

Width:  |  Height:  |  Size: 16 KiB

After

Width:  |  Height:  |  Size: 16 KiB

View File

@ -254,7 +254,7 @@ Media selection
A number of the older NICs such as the 3c590 and 3c900 series have
10base2 and AUI interfaces.
Prior to January, 2001 this driver would autoeselect the 10base2 or AUI
Prior to January, 2001 this driver would autoselect the 10base2 or AUI
port if it didn't detect activity on the 10baseT port. It would then
get stuck on the 10base2 port and a driver reload was necessary to
switch back to 10baseT. This behaviour could not be prevented with a

View File

@ -270,7 +270,7 @@ RX flow rules (ntuple filters)
ethtool -K ethX ntuple <on|off>
When disabling ntuple filters, all the user programed filters are
When disabling ntuple filters, all the user programmed filters are
flushed from the driver cache and hardware. All needed filters must
be re-added when ntuple is re-enabled.
@ -418,7 +418,7 @@ Default value: 0xFFFF
0 Disable interrupt throttling.
1 Enable interrupt throttling and use specified tx and rx rates.
0xFFFF Auto throttling mode. Driver will choose the best RX and TX
interrupt throtting settings based on link speed.
interrupt throttling settings based on link speed.
====== ==============================================================
aq_itr_tx - TX interrupt throttle rate
@ -456,7 +456,7 @@ AQ_CFG_RX_PAGEORDER
Default value: 0
RX page order override. Thats a power of 2 number of RX pages allocated for
RX page order override. That's a power of 2 number of RX pages allocated for
each descriptor. Received descriptor size is still limited by
AQ_CFG_RX_FRAME_MAX.

View File

@ -11,7 +11,7 @@ Overview
--------
The DPAA2 MAC / PHY support consists of a set of APIs that help DPAA2 network
drivers (dpaa2-eth, dpaa2-ethsw) interract with the PHY library.
drivers (dpaa2-eth, dpaa2-ethsw) interact with the PHY library.
DPAA2 Software Architecture
---------------------------

View File

@ -39,7 +39,7 @@ Contents:
intel/ice
marvell/octeontx2
marvell/octeon_ep
mellanox/mlx5
mellanox/mlx5/index
microsoft/netvsc
neterion/s2io
netronome/nfp

View File

@ -901,15 +901,17 @@ To enable/disable UDP Segmentation Offload, issue the following command::
# ethtool -K <ethX> tx-udp-segmentation [off|on]
GNSS module
-----------
Allows user to read messages from the GNSS module and write supported commands.
If the module is physically present, driver creates 2 TTYs for each supported
device in /dev, ttyGNSS_<device>:<function>_0 and _1. First one (_0) is RW and
the second one is RO.
The protocol of write commands is dependent on the GNSS module as the driver
writes raw bytes from the TTY to the GNSS i2c. Please refer to the module
documentation for details.
Requires kernel compiled with CONFIG_GNSS=y or CONFIG_GNSS=m.
Allows user to read messages from the GNSS hardware module and write supported
commands. If the module is physically present, a GNSS device is spawned:
``/dev/gnss<id>``.
The protocol of write command is dependent on the GNSS hardware module as the
driver writes raw bytes by the GNSS object to the receiver through i2c. Please
refer to the hardware GNSS module documentation for configuration details.
Performance Optimization
========================

View File

@ -1,746 +0,0 @@
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
=================================================
Mellanox ConnectX(R) mlx5 core VPI Network Driver
=================================================
Copyright (c) 2019, Mellanox Technologies LTD.
Contents
========
- `Enabling the driver and kconfig options`_
- `Devlink info`_
- `Devlink parameters`_
- `Bridge offload`_
- `mlx5 subfunction`_
- `mlx5 function attributes`_
- `Devlink health reporters`_
- `mlx5 tracepoints`_
Enabling the driver and kconfig options
=======================================
| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
| at build time via kernel Kconfig flags.
| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
| For the list of advanced features, please see below.
**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
**CONFIG_MLX5_CORE_EN=(y/n)**
| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
| built-in into mlx5_core.ko.
**CONFIG_MLX5_EN_ARFS=(y/n)**
| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
**CONFIG_MLX5_EN_RXNFC=(y/n)**
| Enables ethtool receive network flow classification, which allows user defined
| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
**CONFIG_MLX5_MPFS=(y/n)**
| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
| user configured unicast MAC addresses to the requesting PF.
**CONFIG_MLX5_ESWITCH=(y/n)**
| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
| and switching for the enabled VFs and PF in two available modes:
| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
**CONFIG_MLX5_CORE_IPOIB=(y/n)**
| IPoIB offloads & acceleration support.
| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
| IPoIB ulp netdevice.
**CONFIG_MLX5_FPGA=(y/n)**
| Build support for the Innova family of network cards by Mellanox Technologies.
| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
| building sandbox-specific client drivers.
**CONFIG_MLX5_EN_IPSEC=(y/n)**
| Enables `IPSec XFRM cryptography-offload acceleration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
**CONFIG_MLX5_EN_TLS=(y/n)**
| TLS cryptography-offload acceleration.
**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
**CONFIG_MLX5_SF=(y/n)**
| Build support for subfunction.
| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
| will enable support for creating subfunction devices.
**External options** ( Choose if the corresponding mlx5 feature is required )
- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
Devlink info
============
The devlink info reports the running and stored firmware versions on device.
It also prints the device PSID which represents the HCA board type ID.
User command example::
$ devlink dev info pci/0000:00:06.0
pci/0000:00:06.0:
driver mlx5_core
versions:
fixed:
fw.psid MT_0000000009
running:
fw.version 16.26.0100
stored:
fw.version 16.26.0100
Devlink parameters
==================
flow_steering_mode: Device flow steering mode
---------------------------------------------
The flow steering mode parameter controls the flow steering mode of the driver.
Two modes are supported:
1. 'dmfs' - Device managed flow steering.
2. 'smfs' - Software/Driver managed flow steering.
In DMFS mode, the HW steering entities are created and managed through the
Firmware.
In SMFS mode, the HW steering entities are created and managed though by
the driver directly into hardware without firmware intervention.
SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode.
User command examples:
- Set SMFS flow steering mode::
$ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
- Read device flow steering mode::
$ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
pci/0000:06:00.0:
name flow_steering_mode type driver-specific
values:
cmode runtime value smfs
enable_roce: RoCE enablement state
----------------------------------
RoCE enablement state controls driver support for RoCE traffic.
When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well-known UDP RoCE port is handled as raw ethernet traffic.
To change RoCE enablement state, a user must change the driverinit cmode value and run devlink reload.
User command examples:
- Disable RoCE::
$ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
- Read RoCE enablement state::
$ devlink dev param show pci/0000:06:00.0 name enable_roce
pci/0000:06:00.0:
name enable_roce type generic
values:
cmode driverinit value true
esw_port_metadata: Eswitch port metadata state
----------------------------------------------
When applicable, disabling eswitch metadata can increase packet rate
up to 20% depending on the use case and packet sizes.
Eswitch port metadata state controls whether to internally tag packets with
metadata. Metadata tagging must be enabled for multi-port RoCE, failover
between representors and stacked devices.
By default metadata is enabled on the supported devices in E-switch.
Metadata is applicable only for E-switch in switchdev mode and
users may disable it when NONE of the below use cases will be in use:
1. HCA is in Dual/multi-port RoCE mode.
2. VF/SF representor bonding (Usually used for Live migration)
3. Stacked devices
When metadata is disabled, the above use cases will fail to initialize if
users try to enable them.
- Show eswitch port metadata::
$ devlink dev param show pci/0000:06:00.0 name esw_port_metadata
pci/0000:06:00.0:
name esw_port_metadata type driver-specific
values:
cmode runtime value true
- Disable eswitch port metadata::
$ devlink dev param set pci/0000:06:00.0 name esw_port_metadata value false cmode runtime
- Change eswitch mode to switchdev mode where after choosing the metadata value::
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
Bridge offload
==============
The mlx5 driver implements support for offloading bridge rules when in switchdev
mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
representor is attached to bridge.
- Change device to switchdev mode::
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
$ ip link set enp8s0f0 master bridge1
VLANs
-----
Following bridge VLAN functions are supported by mlx5:
- VLAN filtering (including multiple VLANs per port)::
$ ip link set bridge1 type bridge vlan_filtering 1
$ bridge vlan add dev enp8s0f0 vid 2-3
- VLAN push on bridge ingress::
$ bridge vlan add dev enp8s0f0 vid 3 pvid
- VLAN pop on bridge egress::
$ bridge vlan add dev enp8s0f0 vid 3 untagged
mlx5 subfunction
================
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
A subfunction has its own function capabilities and its own resources. This
means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
queues are neither shared nor stolen from the parent PCI function.
When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
resources neither shared nor stolen from the parent PCI function.
A subfunction has a dedicated window in PCI BAR space that is not shared
with the other subfunctions or the parent PCI function. This ensures that all
devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
PCI BAR space.
A subfunction supports eswitch representation through which it supports tc
offloads. The user configures eswitch to send/receive packets from/to
the subfunction port.
Subfunctions share PCI level resources such as PCI MSI-X IRQs with
other subfunctions and/or with its parent PCI function.
Example mlx5 software, system, and device view::
_______
| admin |
| user |----------
|_______| |
| |
____|____ __|______ _________________
| | | | | |
| devlink | | tc tool | | user |
| tool | |_________| | applications |
|_________| | |_________________|
| | | |
| | | | Userspace
+---------|-------------|-------------------|----------|--------------------+
| | +----------+ +----------+ Kernel
| | | netdev | | rdma dev |
| | +----------+ +----------+
(devlink port add/del | ^ ^
port function set) | | |
| | +---------------|
_____|___ | | _______|_______
| | | | | mlx5 class |
| devlink | +------------+ | | drivers |
| kernel | | rep netdev | | |(mlx5_core,ib) |
|_________| +------------+ | |_______________|
| | | ^
(devlink ops) | | (probe/remove)
_________|________ | | ____|________
| subfunction | | +---------------+ | subfunction |
| management driver|----- | subfunction |---| driver |
| (mlx5_core) | | auxiliary dev | | (mlx5_core) |
|__________________| +---------------+ |_____________|
| ^
(sf add/del, vhca events) |
| (device add/del)
_____|____ ____|________
| | | subfunction |
| PCI NIC |--- activate/deactivate events--->| host driver |
|__________| | (mlx5_core) |
|_____________|
Subfunction is created using devlink port interface.
- Change device to switchdev mode::
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
- Add a devlink port of subfunction flavour::
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
- Show a devlink port of the subfunction::
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
- Delete a devlink port of subfunction after use::
$ devlink port del pci/0000:06:00.0/32768
mlx5 function attributes
========================
The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
a unified way for SmartNIC and non-SmartNIC.
This is supported only when the eswitch mode is set to switchdev. Port function
configuration of the PCI VF/SF is supported through devlink eswitch port.
Port function attributes should be set before PCI VF/SF is enumerated by the
driver.
MAC address setup
-----------------
mlx5 driver support devlink port function attr mechanism to setup MAC
address. (refer to Documentation/networking/devlink/devlink-port.rst)
RoCE capability setup
---------------------
Not all mlx5 PCI devices/SFs require RoCE capability.
When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
PCI devices/SF.
mlx5 driver support devlink port function attr mechanism to setup RoCE
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
migratable capability setup
---------------------------
User who wants mlx5 PCI VFs to be able to perform live migration need to
explicitly enable the VF migratable capability.
mlx5 driver support devlink port function attr mechanism to setup migratable
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
SF state setup
--------------
To use the SF, the user must activate the SF using the SF function state
attribute.
- Get the state of the SF identified by its unique devlink port index::
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state inactive opstate detached
- Activate the function and verify its state is active::
$ devlink port function set ens2f0npf0sf88 state active
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state active opstate detached
Upon function activation, the PF driver instance gets the event from the device
that a particular SF was activated. It's the cue to put the device on bus, probe
it and instantiate the devlink instance and class specific auxiliary devices
for it.
- Show the auxiliary device and port of the subfunction::
$ devlink dev show
devlink dev show auxiliary/mlx5_core.sf.4
$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
$ rdma link show mlx5_0/1
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
$ rdma dev show
8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
- Subfunction auxiliary device and class device hierarchy::
mlx5_core.sf.4
(subfunction auxiliary device)
/\
/ \
/ \
/ \
/ \
mlx5_core.eth.4 mlx5_core.rdma.4
(sf eth aux dev) (sf rdma aux dev)
| |
| |
p0sf88 mlx5_0
(sf netdev) (sf rdma device)
Additionally, the SF port also gets the event when the driver attaches to the
auxiliary device of the subfunction. This results in changing the operational
state of the function. This provides visibility to the user to decide when is it
safe to delete the SF port for graceful termination of the subfunction.
- Show the SF port operational state::
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state active opstate attached
Devlink health reporters
========================
tx reporter
-----------
The tx reporter is responsible for reporting and recovering of the following two error scenarios:
- tx timeout
Report on kernel tx timeout detection.
Recover by searching lost interrupts.
- tx error completion
Report on error tx completion.
Recover by flushing the tx queue and reset it.
tx reporter also support on demand diagnose callback, on which it provides
real time information of its send queues status.
User commands examples:
- Diagnose send queues status::
$ devlink health diagnose pci/0000:82:00.0 reporter tx
NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
- Show number of tx errors indicated, number of recover flows ended successfully,
is autorecover enabled and graceful period from last recover::
$ devlink health show pci/0000:82:00.0 reporter tx
rx reporter
-----------
The rx reporter is responsible for reporting and recovering of the following two error scenarios:
- rx queues' initialization (population) timeout
Population of rx queues' descriptors on ring initialization is done
in napi context via triggering an irq. In case of a failure to get
the minimum amount of descriptors, a timeout would occur, and
descriptors could be recovered by polling the EQ (Event Queue).
- rx completions with errors (reported by HW on interrupt context)
Report on rx completion error.
Recover (if needed) by flushing the related queue and reset it.
rx reporter also supports on demand diagnose callback, on which it
provides real time information of its receive queues' status.
- Diagnose rx queues' status and corresponding completion queue::
$ devlink health diagnose pci/0000:82:00.0 reporter rx
NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output.
- Show number of rx errors indicated, number of recover flows ended successfully,
is autorecover enabled, and graceful period from last recover::
$ devlink health show pci/0000:82:00.0 reporter rx
fw reporter
-----------
The fw reporter implements `diagnose` and `dump` callbacks.
It follows symptoms of fw error such as fw syndrome by triggering
fw core dump and storing it into the dump buffer.
The fw reporter diagnose command can be triggered any time by the user to check
current fw status.
User commands examples:
- Check fw heath status::
$ devlink health diagnose pci/0000:82:00.0 reporter fw
- Read FW core dump if already stored or trigger new one::
$ devlink health dump show pci/0000:82:00.0 reporter fw
NOTE: This command can run only on the PF which has fw tracer ownership,
running it on other PF or any VF will return "Operation not permitted".
fw fatal reporter
-----------------
The fw fatal reporter implements `dump` and `recover` callbacks.
It follows fatal errors indications by CR-space dump and recover flow.
The CR-space dump uses vsc interface which is valid even if the FW command
interface is not functional, which is the case in most FW fatal errors.
The recover function runs recover flow which reloads the driver and triggers fw
reset if needed.
On firmware error, the health buffer is dumped into the dmesg. The log
level is derived from the error's severity (given in health buffer).
User commands examples:
- Run fw recover flow manually::
$ devlink health recover pci/0000:82:00.0 reporter fw_fatal
- Read FW CR-space dump if already stored or trigger new one::
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
NOTE: This command can run only on PF.
mlx5 tracepoints
================
mlx5 driver provides internal tracepoints for tracking and debugging using
kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
For the list of support mlx5 events, check `/sys/kernel/debug/tracing/events/mlx5/`.
tc and eswitch offloads tracepoints:
- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
$ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
$ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
- mlx5e_stats_flower: trace flower stats request::
$ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
$ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
$ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
Bridge offloads tracepoints:
- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
$ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
$ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
mlx5::
$ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
representor::
$ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
representor::
$ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
device::
$ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1
- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
device::
$ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
Eswitch QoS tracepoints:
- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport::
$ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3
- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport::
$ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3
- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport::
$ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3
- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group::
$ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5
- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group::
$ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000
- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group::
$ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1
SF tracepoints:
- mlx5_sf_add: trace addition of the SF port::
$ echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88
- mlx5_sf_free: trace freeing of the SF port::
$ echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000
- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context::
$ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88
- mlx5_sf_hwc_free: trace freeing of the hardware SF context::
$ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000
- mlx5_sf_hwc_deferred_free : trace deferred freeing of the hardware SF context::
$ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000
- mlx5_sf_vhca_event: trace SF vhca event and state::
$ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1
- mlx5_sf_dev_add : trace SF device add event::
$ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
- mlx5_sf_dev_del : trace SF device delete event::
$ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,224 @@
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
.. include:: <isonum.txt>
=======
Devlink
=======
:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Contents
========
- `Info`_
- `Parameters`_
- `Health reporters`_
Info
====
The devlink info reports the running and stored firmware versions on device.
It also prints the device PSID which represents the HCA board type ID.
User command example::
$ devlink dev info pci/0000:00:06.0
pci/0000:00:06.0:
driver mlx5_core
versions:
fixed:
fw.psid MT_0000000009
running:
fw.version 16.26.0100
stored:
fw.version 16.26.0100
Parameters
==========
flow_steering_mode: Device flow steering mode
---------------------------------------------
The flow steering mode parameter controls the flow steering mode of the driver.
Two modes are supported:
1. 'dmfs' - Device managed flow steering.
2. 'smfs' - Software/Driver managed flow steering.
In DMFS mode, the HW steering entities are created and managed through the
Firmware.
In SMFS mode, the HW steering entities are created and managed though by
the driver directly into hardware without firmware intervention.
SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode.
User command examples:
- Set SMFS flow steering mode::
$ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
- Read device flow steering mode::
$ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
pci/0000:06:00.0:
name flow_steering_mode type driver-specific
values:
cmode runtime value smfs
enable_roce: RoCE enablement state
----------------------------------
If the device supports RoCE disablement, RoCE enablement state controls device
support for RoCE capability. Otherwise, the control occurs in the driver stack.
When RoCE is disabled at the driver level, only raw ethernet QPs are supported.
To change RoCE enablement state, a user must change the driverinit cmode value
and run devlink reload.
User command examples:
- Disable RoCE::
$ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
- Read RoCE enablement state::
$ devlink dev param show pci/0000:06:00.0 name enable_roce
pci/0000:06:00.0:
name enable_roce type generic
values:
cmode driverinit value true
esw_port_metadata: Eswitch port metadata state
----------------------------------------------
When applicable, disabling eswitch metadata can increase packet rate
up to 20% depending on the use case and packet sizes.
Eswitch port metadata state controls whether to internally tag packets with
metadata. Metadata tagging must be enabled for multi-port RoCE, failover
between representors and stacked devices.
By default metadata is enabled on the supported devices in E-switch.
Metadata is applicable only for E-switch in switchdev mode and
users may disable it when NONE of the below use cases will be in use:
1. HCA is in Dual/multi-port RoCE mode.
2. VF/SF representor bonding (Usually used for Live migration)
3. Stacked devices
When metadata is disabled, the above use cases will fail to initialize if
users try to enable them.
- Show eswitch port metadata::
$ devlink dev param show pci/0000:06:00.0 name esw_port_metadata
pci/0000:06:00.0:
name esw_port_metadata type driver-specific
values:
cmode runtime value true
- Disable eswitch port metadata::
$ devlink dev param set pci/0000:06:00.0 name esw_port_metadata value false cmode runtime
- Change eswitch mode to switchdev mode where after choosing the metadata value::
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
Health reporters
================
tx reporter
-----------
The tx reporter is responsible for reporting and recovering of the following two error scenarios:
- tx timeout
Report on kernel tx timeout detection.
Recover by searching lost interrupts.
- tx error completion
Report on error tx completion.
Recover by flushing the tx queue and reset it.
tx reporter also support on demand diagnose callback, on which it provides
real time information of its send queues status.
User commands examples:
- Diagnose send queues status::
$ devlink health diagnose pci/0000:82:00.0 reporter tx
NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
- Show number of tx errors indicated, number of recover flows ended successfully,
is autorecover enabled and graceful period from last recover::
$ devlink health show pci/0000:82:00.0 reporter tx
rx reporter
-----------
The rx reporter is responsible for reporting and recovering of the following two error scenarios:
- rx queues' initialization (population) timeout
Population of rx queues' descriptors on ring initialization is done
in napi context via triggering an irq. In case of a failure to get
the minimum amount of descriptors, a timeout would occur, and
descriptors could be recovered by polling the EQ (Event Queue).
- rx completions with errors (reported by HW on interrupt context)
Report on rx completion error.
Recover (if needed) by flushing the related queue and reset it.
rx reporter also supports on demand diagnose callback, on which it
provides real time information of its receive queues' status.
- Diagnose rx queues' status and corresponding completion queue::
$ devlink health diagnose pci/0000:82:00.0 reporter rx
NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output.
- Show number of rx errors indicated, number of recover flows ended successfully,
is autorecover enabled, and graceful period from last recover::
$ devlink health show pci/0000:82:00.0 reporter rx
fw reporter
-----------
The fw reporter implements `diagnose` and `dump` callbacks.
It follows symptoms of fw error such as fw syndrome by triggering
fw core dump and storing it into the dump buffer.
The fw reporter diagnose command can be triggered any time by the user to check
current fw status.
User commands examples:
- Check fw heath status::
$ devlink health diagnose pci/0000:82:00.0 reporter fw
- Read FW core dump if already stored or trigger new one::
$ devlink health dump show pci/0000:82:00.0 reporter fw
NOTE: This command can run only on the PF which has fw tracer ownership,
running it on other PF or any VF will return "Operation not permitted".
fw fatal reporter
-----------------
The fw fatal reporter implements `dump` and `recover` callbacks.
It follows fatal errors indications by CR-space dump and recover flow.
The CR-space dump uses vsc interface which is valid even if the FW command
interface is not functional, which is the case in most FW fatal errors.
The recover function runs recover flow which reloads the driver and triggers fw
reset if needed.
On firmware error, the health buffer is dumped into the dmesg. The log
level is derived from the error's severity (given in health buffer).
User commands examples:
- Run fw recover flow manually::
$ devlink health recover pci/0000:82:00.0 reporter fw_fatal
- Read FW CR-space dump if already stored or trigger new one::
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
NOTE: This command can run only on PF.

View File

@ -0,0 +1,26 @@
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
.. include:: <isonum.txt>
Mellanox ConnectX(R) mlx5 core VPI Network Driver
=================================================
:Copyright: |copy| 2019, Mellanox Technologies LTD.
:Copyright: |copy| 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Contents:
.. toctree::
:maxdepth: 2
kconfig
devlink
switchdev
tracepoints
counters
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@ -0,0 +1,168 @@
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
.. include:: <isonum.txt>
=======================================
Enabling the driver and kconfig options
=======================================
:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
| at build time via kernel Kconfig flags.
| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
| For the list of advanced features, please see below.
**CONFIG_MLX5_BRIDGE=(y/n)**
| Enable :ref:`Ethernet Bridging (BRIDGE) offloading support <mlx5_bridge_offload>`.
| This will provide the ability to add representors of mlx5 uplink and VF
| ports to Bridge and offloading rules for traffic between such ports.
| Supports VLANs (trunk and access modes).
**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
**CONFIG_MLX5_CORE_EN=(y/n)**
| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
| built-in into mlx5_core.ko.
**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
**CONFIG_MLX5_CORE_IPOIB=(y/n)**
| IPoIB offloads & acceleration support.
| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
| IPoIB ulp netdevice.
**CONFIG_MLX5_CLS_ACT=(y/n)**
| Enables offload support for TC classifier action (NET_CLS_ACT).
| Works in both native NIC mode and Switchdev SRIOV mode.
| Flow-based classifiers, such as those registered through
| `tc-flower(8)`, are processed by the device, rather than the
| host. Actions that would then overwrite matching classification
| results would then be instant due to the offload.
**CONFIG_MLX5_EN_ARFS=(y/n)**
| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
**CONFIG_MLX5_EN_IPSEC=(y/n)**
| Enables `IPSec XFRM cryptography-offload acceleration <https://support.mellanox.com/s/article/ConnectX-6DX-Bluefield-2-IPsec-HW-Full-Offload-Configuration-Guide>`_.
**CONFIG_MLX5_EN_MACSEC=(y/n)**
| Build support for MACsec cryptography-offload acceleration in the NIC.
**CONFIG_MLX5_EN_RXNFC=(y/n)**
| Enables ethtool receive network flow classification, which allows user defined
| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
**CONFIG_MLX5_EN_TLS=(y/n)**
| TLS cryptography-offload acceleration.
**CONFIG_MLX5_ESWITCH=(y/n)**
| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
| and switching for the enabled VFs and PF in two available modes:
| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
**CONFIG_MLX5_FPGA=(y/n)**
| Build support for the Innova family of network cards by Mellanox Technologies.
| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
| building sandbox-specific client drivers.
**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
**CONFIG_MLX5_MPFS=(y/n)**
| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
| user configured unicast MAC addresses to the requesting PF.
**CONFIG_MLX5_SF=(y/n)**
| Build support for subfunction.
| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
| will enable support for creating subfunction devices.
**CONFIG_MLX5_SF_MANAGER=(y/n)**
| Build support for subfuction port in the NIC. A Mellanox subfunction
| port is managed through devlink. A subfunction supports RDMA, netdevice
| and vdpa device. It is similar to a SRIOV VF but it doesn't require
| SRIOV support.
**CONFIG_MLX5_SW_STEERING=(y/n)**
| Build support for software-managed steering in the NIC.
**CONFIG_MLX5_TC_CT=(y/n)**
| Support offloading connection tracking rules via tc ct action.
**CONFIG_MLX5_TC_SAMPLE=(y/n)**
| Support offloading sample rules via tc sample action.
**CONFIG_MLX5_VDPA=(y/n)**
| Support library for Mellanox VDPA drivers. Provides code that is
| common for all types of VDPA drivers. The following drivers are planned:
| net, block.
**CONFIG_MLX5_VDPA_NET=(y/n)**
| VDPA network driver for ConnectX6 and newer. Provides offloading
| of virtio net datapath such that descriptors put on the ring will
| be executed by the hardware. It also supports a variety of stateless
| offloads depending on the actual device used and firmware version.
**CONFIG_MLX5_VFIO_PCI=(y/n)**
| This provides migration support for MLX5 devices using the VFIO framework.
**External options** ( Choose if the corresponding mlx5 feature is required )
- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.

View File

@ -0,0 +1,239 @@
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
.. include:: <isonum.txt>
=========
Switchdev
=========
:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
.. _mlx5_bridge_offload:
Bridge offload
==============
The mlx5 driver implements support for offloading bridge rules when in switchdev
mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
representor is attached to bridge.
- Change device to switchdev mode::
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
$ ip link set enp8s0f0 master bridge1
VLANs
-----
Following bridge VLAN functions are supported by mlx5:
- VLAN filtering (including multiple VLANs per port)::
$ ip link set bridge1 type bridge vlan_filtering 1
$ bridge vlan add dev enp8s0f0 vid 2-3
- VLAN push on bridge ingress::
$ bridge vlan add dev enp8s0f0 vid 3 pvid
- VLAN pop on bridge egress::
$ bridge vlan add dev enp8s0f0 vid 3 untagged
Subfunction
===========
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
A subfunction has its own function capabilities and its own resources. This
means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
queues are neither shared nor stolen from the parent PCI function.
When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
resources neither shared nor stolen from the parent PCI function.
A subfunction has a dedicated window in PCI BAR space that is not shared
with the other subfunctions or the parent PCI function. This ensures that all
devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
PCI BAR space.
A subfunction supports eswitch representation through which it supports tc
offloads. The user configures eswitch to send/receive packets from/to
the subfunction port.
Subfunctions share PCI level resources such as PCI MSI-X IRQs with
other subfunctions and/or with its parent PCI function.
Example mlx5 software, system, and device view::
_______
| admin |
| user |----------
|_______| |
| |
____|____ __|______ _________________
| | | | | |
| devlink | | tc tool | | user |
| tool | |_________| | applications |
|_________| | |_________________|
| | | |
| | | | Userspace
+---------|-------------|-------------------|----------|--------------------+
| | +----------+ +----------+ Kernel
| | | netdev | | rdma dev |
| | +----------+ +----------+
(devlink port add/del | ^ ^
port function set) | | |
| | +---------------|
_____|___ | | _______|_______
| | | | | mlx5 class |
| devlink | +------------+ | | drivers |
| kernel | | rep netdev | | |(mlx5_core,ib) |
|_________| +------------+ | |_______________|
| | | ^
(devlink ops) | | (probe/remove)
_________|________ | | ____|________
| subfunction | | +---------------+ | subfunction |
| management driver|----- | subfunction |---| driver |
| (mlx5_core) | | auxiliary dev | | (mlx5_core) |
|__________________| +---------------+ |_____________|
| ^
(sf add/del, vhca events) |
| (device add/del)
_____|____ ____|________
| | | subfunction |
| PCI NIC |--- activate/deactivate events--->| host driver |
|__________| | (mlx5_core) |
|_____________|
Subfunction is created using devlink port interface.
- Change device to switchdev mode::
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
- Add a devlink port of subfunction flavour::
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
- Show a devlink port of the subfunction::
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
- Delete a devlink port of subfunction after use::
$ devlink port del pci/0000:06:00.0/32768
Function attributes
===================
The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
a unified way for SmartNIC and non-SmartNIC.
This is supported only when the eswitch mode is set to switchdev. Port function
configuration of the PCI VF/SF is supported through devlink eswitch port.
Port function attributes should be set before PCI VF/SF is enumerated by the
driver.
MAC address setup
-----------------
mlx5 driver support devlink port function attr mechanism to setup MAC
address. (refer to Documentation/networking/devlink/devlink-port.rst)
RoCE capability setup
~~~~~~~~~~~~~~~~~~~~~
Not all mlx5 PCI devices/SFs require RoCE capability.
When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
PCI devices/SF.
mlx5 driver support devlink port function attr mechanism to setup RoCE
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
migratable capability setup
~~~~~~~~~~~~~~~~~~~~~~~~~~~
User who wants mlx5 PCI VFs to be able to perform live migration need to
explicitly enable the VF migratable capability.
mlx5 driver support devlink port function attr mechanism to setup migratable
capability. (refer to Documentation/networking/devlink/devlink-port.rst)
SF state setup
--------------
To use the SF, the user must activate the SF using the SF function state
attribute.
- Get the state of the SF identified by its unique devlink port index::
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state inactive opstate detached
- Activate the function and verify its state is active::
$ devlink port function set ens2f0npf0sf88 state active
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state active opstate detached
Upon function activation, the PF driver instance gets the event from the device
that a particular SF was activated. It's the cue to put the device on bus, probe
it and instantiate the devlink instance and class specific auxiliary devices
for it.
- Show the auxiliary device and port of the subfunction::
$ devlink dev show
devlink dev show auxiliary/mlx5_core.sf.4
$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
$ rdma link show mlx5_0/1
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
$ rdma dev show
8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
- Subfunction auxiliary device and class device hierarchy::
mlx5_core.sf.4
(subfunction auxiliary device)
/\
/ \
/ \
/ \
/ \
mlx5_core.eth.4 mlx5_core.rdma.4
(sf eth aux dev) (sf rdma aux dev)
| |
| |
p0sf88 mlx5_0
(sf netdev) (sf rdma device)
Additionally, the SF port also gets the event when the driver attaches to the
auxiliary device of the subfunction. This results in changing the operational
state of the function. This provides visibility to the user to decide when is it
safe to delete the SF port for graceful termination of the subfunction.
- Show the SF port operational state::
$ devlink port show ens2f0npf0sf88
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state active opstate attached

View File

@ -0,0 +1,229 @@
.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
.. include:: <isonum.txt>
===========
Tracepoints
===========
:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
mlx5 driver provides internal tracepoints for tracking and debugging using
kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
For the list of support mlx5 events, check `/sys/kernel/debug/tracing/events/mlx5/`.
tc and eswitch offloads tracepoints:
- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
$ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
$ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
- mlx5e_stats_flower: trace flower stats request::
$ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
$ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
$ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
Bridge offloads tracepoints:
- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
$ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
$ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
mlx5::
$ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
representor::
$ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
representor::
$ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
device::
$ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1
- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
device::
$ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
$ cat /sys/kernel/debug/tracing/trace
...
ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
Eswitch QoS tracepoints:
- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport::
$ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3
- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport::
$ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3
- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport::
$ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3
- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group::
$ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5
- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group::
$ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000
- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group::
$ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
<...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1
SF tracepoints:
- mlx5_sf_add: trace addition of the SF port::
$ echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88
- mlx5_sf_free: trace freeing of the SF port::
$ echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000
- mlx5_sf_activate: trace activation of the SF port::
$ echo mlx5:mlx5_sf_activate >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-29841 [008] ..... 3669.635095: mlx5_sf_activate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
- mlx5_sf_deactivate: trace deactivation of the SF port::
$ echo mlx5:mlx5_sf_deactivate >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-29994 [008] ..... 4015.969467: mlx5_sf_deactivate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context::
$ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88
- mlx5_sf_hwc_free: trace freeing of the hardware SF context::
$ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000
- mlx5_sf_hwc_deferred_free: trace deferred freeing of the hardware SF context::
$ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000
- mlx5_sf_update_state: trace state updates for SF contexts::
$ echo mlx5:mlx5_sf_update_state >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u20:3-29490 [009] ..... 4141.453530: mlx5_sf_update_state: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000 state=2
- mlx5_sf_vhca_event: trace SF vhca event and state::
$ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1
- mlx5_sf_dev_add: trace SF device add event::
$ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
- mlx5_sf_dev_del: trace SF device delete event::
$ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88

View File

@ -83,7 +83,7 @@ Configuring the Driver
MTU
---
Jumbo frame support is available with a maximim size of 9194 bytes.
Jumbo frame support is available with a maximum size of 9194 bytes.
Interrupt coalescing
--------------------

View File

@ -124,7 +124,7 @@ Multicast flooding
==================
CPU port mcast_flooding is always on
Turning flooding on/off on swithch ports:
Turning flooding on/off on switch ports:
bridge link set dev sw0p1 mcast_flood on/off
Access and Trunk port

View File

@ -174,7 +174,7 @@ Multicast flooding
==================
CPU port mcast_flooding is always on
Turning flooding on/off on swithch ports:
Turning flooding on/off on switch ports:
bridge link set dev sw0p1 mcast_flood on/off
Access and Trunk port

View File

@ -33,7 +33,7 @@ Device driver can provide specific callbacks for each "health reporter", e.g.:
* Recovery procedures
* Diagnostics procedures
* Object dump procedures
* OOB initial parameters
* Out Of Box initial parameters
Different parts of the driver can register different types of health reporters
with different handlers.
@ -46,12 +46,31 @@ Once an error is reported, devlink health will perform the following actions:
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
* Object dump is being taken and saved at the reporter instance (as long as
there is no other dump which is already stored)
auto-dump is set and there is no other dump which is already stored)
* Auto recovery attempt is being done. Depends on:
- Auto-recovery configuration
- Grace period vs. time passed since last recover
Devlink formatted message
=========================
To handle devlink health diagnose and health dump requests, devlink creates a
formatted message structure ``devlink_fmsg`` and send it to the driver's callback
to fill the data in using the devlink fmsg API.
Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
json-like format. The API allows the driver to add nested attributes such as
object, object pair and value array, in addition to attributes such as name and
value.
Driver should use this API to fill the fmsg context in a format which will be
translated by the devlink to the netlink message later. When it needs to send
the data using SKBs to the netlink layer, it fragments the data between
different SKBs. In order to do this fragmentation, it uses virtual nests
attributes, to avoid actual nesting use which cannot be divided between
different SKBs.
User Interface
==============

View File

@ -285,7 +285,7 @@ features are enabled after the hierarchy is exported, but before any
changes are made.
This feature is also dependent on switchdev being enabled in the system.
It's required bacause devlink-rate requires devlink-port objects to be
It's required because devlink-rate requires devlink-port objects to be
present, and those objects are only created in switchdev mode.
If the driver is set to the switchdev mode, it will export internal
@ -320,7 +320,7 @@ nodes and nodes with children also can't be deleted.
* - ``tx_weight``
- allows for usage of Weighted Fair Queuing arbitration scheme among
siblings. This arbitration scheme can be used simultaneously with
the strict priority. Range 1-200. Only relative values mater for
the strict priority. Range 1-200. Only relative values matter for
arbitration.
``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case

View File

@ -66,3 +66,4 @@ parameters, info versions, and other features it supports.
prestera
iosm
octeontx2
sfc

View File

@ -54,6 +54,24 @@ parameters.
- Control the number of large groups (size > 1) in the FDB table.
* The default value is 15, and the range is between 1 and 1024.
* - ``esw_multiport``
- Boolean
- runtime
- Control MultiPort E-Switch shared fdb mode.
An experimental mode where a single E-Switch is used and all the vports
and physical ports on the NIC are connected to it.
An example is to send traffic from a VF that is created on PF0 to an
uplink that is natively associated with the uplink of PF1
Note: Future devices, ConnectX-8 and onward, will eventually have this
as the default to allow forwarding between all NIC ports in a single
E-switch environment and the dual E-switch mode will likely get
deprecated.
Default: disabled
The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``

View File

@ -95,5 +95,5 @@ Driver-specific Traps
* - ``fid_miss``
- ``exception``
- When a packet enters the device it is classified to a filtering
indentifier (FID) based on the ingress port and VLAN. This trap is used
identifier (FID) based on the ingress port and VLAN. This trap is used
to trap packets for which a FID could not be found

View File

@ -138,4 +138,4 @@ Driver-specific Traps
- Drops packets with zero (0) IPV4 source address.
* - ``met_red``
- ``drop``
- Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwith.
- Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwidth.

View File

@ -0,0 +1,57 @@
.. SPDX-License-Identifier: GPL-2.0
===================
sfc devlink support
===================
This document describes the devlink features implemented by the ``sfc``
device driver for the ef100 device.
Info versions
=============
The ``sfc`` driver reports the following versions
.. list-table:: devlink info versions implemented
:widths: 5 5 90
* - Name
- Type
- Description
* - ``fw.mgmt.suc``
- running
- For boards where the management function is split between multiple
control units, this is the SUC control unit's firmware version.
* - ``fw.mgmt.cmc``
- running
- For boards where the management function is split between multiple
control units, this is the CMC control unit's firmware version.
* - ``fpga.rev``
- running
- FPGA design revision.
* - ``fpga.app``
- running
- Datapath programmable logic version.
* - ``fw.app``
- running
- Datapath software/microcode/firmware version.
* - ``coproc.boot``
- running
- SmartNIC application co-processor (APU) first stage boot loader version.
* - ``coproc.uboot``
- running
- SmartNIC application co-processor (APU) co-operating system loader version.
* - ``coproc.main``
- running
- SmartNIC application co-processor (APU) main operating system version.
* - ``coproc.recovery``
- running
- SmartNIC application co-processor (APU) recovery operating system version.
* - ``fw.exprom``
- running
- Expansion ROM version. For boards where the expansion ROM is split between
multiple images (e.g. PXE and UEFI), this is the specifically the PXE boot
ROM version.
* - ``fw.uefi``
- running
- UEFI driver version (No UNDI support).

View File

@ -5,7 +5,7 @@ DSA switch configuration from userspace
=======================================
The DSA switch configuration is not integrated into the main userspace
network configuration suites by now and has to be performed manualy.
network configuration suites by now and has to be performed manually.
.. _dsa-config-showcases:

View File

@ -106,7 +106,7 @@ modifying a bitmap, the former changes the bit set in mask to values set in
value and preserves the rest; the latter sets the bits set in the bitmap and
clears the rest.
Compact form: nested (bitset) atrribute contents:
Compact form: nested (bitset) attribute contents:
============================ ====== ============================
``ETHTOOL_A_BITSET_NOMASK`` flag no mask, only a list
@ -223,6 +223,8 @@ Userspace to kernel:
``ETHTOOL_MSG_PSE_SET`` set PSE parameters
``ETHTOOL_MSG_PSE_GET`` get PSE parameters
``ETHTOOL_MSG_RSS_GET`` get RSS settings
``ETHTOOL_MSG_MM_GET`` get MAC merge layer state
``ETHTOOL_MSG_MM_SET`` set MAC merge layer parameters
===================================== =================================
Kernel to userspace:
@ -265,6 +267,7 @@ Kernel to userspace:
``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters
``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters
``ETHTOOL_MSG_RSS_GET_REPLY`` RSS settings
``ETHTOOL_MSG_MM_GET_REPLY`` MAC merge layer status
======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
@ -780,7 +783,7 @@ Kernel response contents:
``ETHTOOL_A_FEATURES_ACTIVE`` bitset diff old vs. new active
==================================== ====== ==========================
Request constains only one bitset which can be either value/mask pair (request
Request contains only one bitset which can be either value/mask pair (request
to change specific feature bits and leave the rest) or only a value (request
to set all features to specified set).
@ -871,6 +874,7 @@ Kernel response contents:
``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split
``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
==================================== ====== ===========================
``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with
@ -880,8 +884,8 @@ separate buffers. The device configuration must make it possible to receive
full memory pages of data, for example because MTU is high enough or through
HW-GRO.
``ETHTOOL_A_RINGS_TX_PUSH`` flag is used to enable descriptor fast
path to send packets. In ordinary path, driver fills descriptors in DRAM and
``ETHTOOL_A_RINGS_[RX|TX]_PUSH`` flag is used to enable descriptor fast
path to send or receive packets. In ordinary path, driver fills descriptors in DRAM and
notifies NIC hardware. In fast path, driver pushes descriptors to the device
through MMIO writes, thus reducing the latency. However, enabling this feature
may increase the CPU cost. Drivers may enforce additional per-packet
@ -903,6 +907,7 @@ Request contents:
``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
==================================== ====== ===========================
Kernel checks that requested ring sizes do not exceed limits reported by
@ -1004,6 +1009,9 @@ Kernel response contents:
``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval
``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx
``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
=========================================== ====== =======================
Attributes are only included in reply if their value is not zero or the
@ -1022,6 +1030,17 @@ each packet event resets the timer. In this mode timer is used to force
the interrupt if queue goes idle, while busy queues depend on the packet
limit to trigger interrupts.
Tx aggregation consists of copying frames into a contiguous buffer so that they
can be submitted as a single IO operation. ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``
describes the maximum size in bytes for the submitted buffer.
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` describes the maximum number of frames
that can be aggregated into a single buffer.
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` describes the amount of time in usecs,
counted since the first packet arrival in an aggregated block, after which the
block should be sent.
This feature is mainly of interest for specific USB devices which does not cope
well with frequent small-sized URBs transmissions.
COALESCE_SET
============
@ -1055,6 +1074,9 @@ Request contents:
``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval
``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx
``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
=========================================== ====== =======================
Request is rejected if it attributes declared as unsupported by driver (i.e.
@ -1072,8 +1094,18 @@ Request contents:
===================================== ====== ==========================
``ETHTOOL_A_PAUSE_HEADER`` nested request header
``ETHTOOL_A_PAUSE_STATS_SRC`` u32 source of statistics
===================================== ====== ==========================
``ETHTOOL_A_PAUSE_STATS_SRC`` is optional. It takes values from:
.. kernel-doc:: include/uapi/linux/ethtool.h
:identifiers: ethtool_mac_stats_src
If absent from the request, stats will be provided with
an ``ETHTOOL_A_PAUSE_STATS_SRC`` attribute in the response equal to
``ETHTOOL_MAC_STATS_SRC_AGGREGATE``.
Kernel response contents:
===================================== ====== ==========================
@ -1488,6 +1520,7 @@ Request contents:
======================================= ====== ==========================
``ETHTOOL_A_STATS_HEADER`` nested request header
``ETHTOOL_A_STATS_SRC`` u32 source of statistics
``ETHTOOL_A_STATS_GROUPS`` bitset requested groups of stats
======================================= ====== ==========================
@ -1496,6 +1529,8 @@ Kernel response contents:
+-----------------------------------+--------+--------------------------------+
| ``ETHTOOL_A_STATS_HEADER`` | nested | reply header |
+-----------------------------------+--------+--------------------------------+
| ``ETHTOOL_A_STATS_SRC`` | u32 | source of statistics |
+-----------------------------------+--------+--------------------------------+
| ``ETHTOOL_A_STATS_GRP`` | nested | one or more group of stats |
+-+---------------------------------+--------+--------------------------------+
| | ``ETHTOOL_A_STATS_GRP_ID`` | u32 | group ID - ``ETHTOOL_STATS_*`` |
@ -1557,6 +1592,11 @@ Low and high bounds are inclusive, for example:
etherStatsPkts512to1023Octets 512 1023
============================= ==== ====
``ETHTOOL_A_STATS_SRC`` is optional. Similar to ``PAUSE_GET``, it takes values
from ``enum ethtool_mac_stats_src``. If absent from the request, stats will be
provided with an ``ETHTOOL_A_STATS_SRC`` attribute in the response equal to
``ETHTOOL_MAC_STATS_SRC_AGGREGATE``.
PHC_VCLOCKS_GET
===============
@ -1716,6 +1756,225 @@ being used. Current supported options are toeplitz, xor or crc32.
ETHTOOL_A_RSS_INDIR attribute returns RSS indrection table where each byte
indicates queue number.
PLCA_GET_CFG
============
Gets the IEEE 802.3cg-2019 Clause 148 Physical Layer Collision Avoidance
(PLCA) Reconciliation Sublayer (RS) attributes.
Request contents:
===================================== ====== ==========================
``ETHTOOL_A_PLCA_HEADER`` nested request header
===================================== ====== ==========================
Kernel response contents:
====================================== ====== =============================
``ETHTOOL_A_PLCA_HEADER`` nested reply header
``ETHTOOL_A_PLCA_VERSION`` u16 Supported PLCA management
interface standard/version
``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State
``ETHTOOL_A_PLCA_NODE_ID`` u32 PLCA unique local node ID
``ETHTOOL_A_PLCA_NODE_CNT`` u32 Number of PLCA nodes on the
network, including the
coordinator
``ETHTOOL_A_PLCA_TO_TMR`` u32 Transmit Opportunity Timer
value in bit-times (BT)
``ETHTOOL_A_PLCA_BURST_CNT`` u32 Number of additional packets
the node is allowed to send
within a single TO
``ETHTOOL_A_PLCA_BURST_TMR`` u32 Time to wait for the MAC to
transmit a new frame before
terminating the burst
====================================== ====== =============================
When set, the optional ``ETHTOOL_A_PLCA_VERSION`` attribute indicates which
standard and version the PLCA management interface complies to. When not set,
the interface is vendor-specific and (possibly) supplied by the driver.
The OPEN Alliance SIG specifies a standard register map for 10BASE-T1S PHYs
embedding the PLCA Reconcialiation Sublayer. See "10BASE-T1S PLCA Management
Registers" at https://www.opensig.org/about/specifications/.
When set, the optional ``ETHTOOL_A_PLCA_ENABLED`` attribute indicates the
administrative state of the PLCA RS. When not set, the node operates in "plain"
CSMA/CD mode. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.1
aPLCAAdminState / 30.16.1.2.1 acPLCAAdminControl.
When set, the optional ``ETHTOOL_A_PLCA_NODE_ID`` attribute indicates the
configured local node ID of the PHY. This ID determines which transmit
opportunity (TO) is reserved for the node to transmit into. This option is
corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.4 aPLCALocalNodeID. The valid
range for this attribute is [0 .. 255] where 255 means "not configured".
When set, the optional ``ETHTOOL_A_PLCA_NODE_CNT`` attribute indicates the
configured maximum number of PLCA nodes on the mixing-segment. This number
determines the total number of transmit opportunities generated during a
PLCA cycle. This attribute is relevant only for the PLCA coordinator, which is
the node with aPLCALocalNodeID set to 0. Follower nodes ignore this setting.
This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.3
aPLCANodeCount. The valid range for this attribute is [1 .. 255].
When set, the optional ``ETHTOOL_A_PLCA_TO_TMR`` attribute indicates the
configured value of the transmit opportunity timer in bit-times. This value
must be set equal across all nodes sharing the medium for PLCA to work
correctly. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.5
aPLCATransmitOpportunityTimer. The valid range for this attribute is
[0 .. 255].
When set, the optional ``ETHTOOL_A_PLCA_BURST_CNT`` attribute indicates the
configured number of extra packets that the node is allowed to send during a
single transmit opportunity. By default, this attribute is 0, meaning that
the node can only send a single frame per TO. When greater than 0, the PLCA RS
keeps the TO after any transmission, waiting for the MAC to send a new frame
for up to aPLCABurstTimer BTs. This can only happen a number of times per PLCA
cycle up to the value of this parameter. After that, the burst is over and the
normal counting of TOs resumes. This option is corresponding to
``IEEE 802.3cg-2019`` 30.16.1.1.6 aPLCAMaxBurstCount. The valid range for this
attribute is [0 .. 255].
When set, the optional ``ETHTOOL_A_PLCA_BURST_TMR`` attribute indicates how
many bit-times the PLCA RS waits for the MAC to initiate a new transmission
when aPLCAMaxBurstCount is greater than 0. If the MAC fails to send a new
frame within this time, the burst ends and the counting of TOs resumes.
Otherwise, the new frame is sent as part of the current burst. This option
is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.7 aPLCABurstTimer. The
valid range for this attribute is [0 .. 255]. Although, the value should be
set greater than the Inter-Frame-Gap (IFG) time of the MAC (plus some margin)
for PLCA burst mode to work as intended.
PLCA_SET_CFG
============
Sets PLCA RS parameters.
Request contents:
====================================== ====== =============================
``ETHTOOL_A_PLCA_HEADER`` nested request header
``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State
``ETHTOOL_A_PLCA_NODE_ID`` u8 PLCA unique local node ID
``ETHTOOL_A_PLCA_NODE_CNT`` u8 Number of PLCA nodes on the
netkork, including the
coordinator
``ETHTOOL_A_PLCA_TO_TMR`` u8 Transmit Opportunity Timer
value in bit-times (BT)
``ETHTOOL_A_PLCA_BURST_CNT`` u8 Number of additional packets
the node is allowed to send
within a single TO
``ETHTOOL_A_PLCA_BURST_TMR`` u8 Time to wait for the MAC to
transmit a new frame before
terminating the burst
====================================== ====== =============================
For a description of each attribute, see ``PLCA_GET_CFG``.
PLCA_GET_STATUS
===============
Gets PLCA RS status information.
Request contents:
===================================== ====== ==========================
``ETHTOOL_A_PLCA_HEADER`` nested request header
===================================== ====== ==========================
Kernel response contents:
====================================== ====== =============================
``ETHTOOL_A_PLCA_HEADER`` nested reply header
``ETHTOOL_A_PLCA_STATUS`` u8 PLCA RS operational status
====================================== ====== =============================
When set, the ``ETHTOOL_A_PLCA_STATUS`` attribute indicates whether the node is
detecting the presence of the BEACON on the network. This flag is
corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.2 aPLCAStatus.
MM_GET
======
Retrieve 802.3 MAC Merge parameters.
Request contents:
==================================== ====== ==========================
``ETHTOOL_A_MM_HEADER`` nested request header
==================================== ====== ==========================
Kernel response contents:
================================= ====== ===================================
``ETHTOOL_A_MM_HEADER`` nested request header
``ETHTOOL_A_MM_PMAC_ENABLED`` bool set if RX of preemptible and SMD-V
frames is enabled
``ETHTOOL_A_MM_TX_ENABLED`` bool set if TX of preemptible frames is
administratively enabled (might be
inactive if verification failed)
``ETHTOOL_A_MM_TX_ACTIVE`` bool set if TX of preemptible frames is
operationally enabled
``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE`` u32 minimum size of transmitted
non-final fragments, in octets
``ETHTOOL_A_MM_RX_MIN_FRAG_SIZE`` u32 minimum size of received non-final
fragments, in octets
``ETHTOOL_A_MM_VERIFY_ENABLED`` bool set if TX of SMD-V frames is
administratively enabled
``ETHTOOL_A_MM_VERIFY_STATUS`` u8 state of the verification function
``ETHTOOL_A_MM_VERIFY_TIME`` u32 delay between verification attempts
``ETHTOOL_A_MM_MAX_VERIFY_TIME``` u32 maximum verification interval
supported by device
``ETHTOOL_A_MM_STATS`` nested IEEE 802.3-2018 subclause 30.14.1
oMACMergeEntity statistics counters
================================= ====== ===================================
The attributes are populated by the device driver through the following
structure:
.. kernel-doc:: include/linux/ethtool.h
:identifiers: ethtool_mm_state
The ``ETHTOOL_A_MM_VERIFY_STATUS`` will report one of the values from
.. kernel-doc:: include/uapi/linux/ethtool.h
:identifiers: ethtool_mm_verify_status
If ``ETHTOOL_A_MM_VERIFY_ENABLED`` was passed as false in the ``MM_SET``
command, ``ETHTOOL_A_MM_VERIFY_STATUS`` will report either
``ETHTOOL_MM_VERIFY_STATUS_INITIAL`` or ``ETHTOOL_MM_VERIFY_STATUS_DISABLED``,
otherwise it should report one of the other states.
It is recommended that drivers start with the pMAC disabled, and enable it upon
user space request. It is also recommended that user space does not depend upon
the default values from ``ETHTOOL_MSG_MM_GET`` requests.
``ETHTOOL_A_MM_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set in
``ETHTOOL_A_HEADER_FLAGS``. The attribute will be empty if driver did not
report any statistics. Drivers fill in the statistics in the following
structure:
.. kernel-doc:: include/linux/ethtool.h
:identifiers: ethtool_mm_stats
MM_SET
======
Modifies the configuration of the 802.3 MAC Merge layer.
Request contents:
================================= ====== ==========================
``ETHTOOL_A_MM_VERIFY_TIME`` u32 see MM_GET description
``ETHTOOL_A_MM_VERIFY_ENABLED`` bool see MM_GET description
``ETHTOOL_A_MM_TX_ENABLED`` bool see MM_GET description
``ETHTOOL_A_MM_PMAC_ENABLED`` bool see MM_GET description
``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE`` u32 see MM_GET description
================================= ====== ==========================
The attributes are propagated to the driver through the following structure:
.. kernel-doc:: include/linux/ethtool.h
:identifiers: ethtool_mm_cfg
Request translation
===================
@ -1817,4 +2076,9 @@ are netlink only.
n/a ``ETHTOOL_MSG_PHC_VCLOCKS_GET``
n/a ``ETHTOOL_MSG_MODULE_GET``
n/a ``ETHTOOL_MSG_MODULE_SET``
n/a ``ETHTOOL_MSG_PLCA_GET_CFG``
n/a ``ETHTOOL_MSG_PLCA_SET_CFG``
n/a ``ETHTOOL_MSG_PLCA_GET_STATUS``
n/a ``ETHTOOL_MSG_MM_GET``
n/a ``ETHTOOL_MSG_MM_SET``
=================================== =====================================

View File

@ -162,7 +162,7 @@ Local GTP-U entity and tunnel identification
GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152
for GTPv1-U and 3386 for GTPv0-U.
There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW
There is only one GTP-U entity (and therefore SGSN/GGSN/S-GW/PDN-GW
instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique
per GTP-U entity.

Some files were not shown because too many files have changed in this diff Show More